GenAI
The Era of LLM Infrastructure
Feb 12, 2024

The Era of LLM Infrastructure

Embark on a web development adventure with our inaugural post, diving deep into industry insights and emerging trends.


API access to large language models has opened up a world of opportunities. We have seen many simple proof-of-concept applications show promise in being effective. However, as the complexity of these applications grows, several crucial issues arise when putting these systems into production. These issues include unreliable API endpoints, slow token generation, LLM lock-in, and cost management. Clearly, the LLM era will require solutions to manage LLM API endpoints.

Glide is a cloud-native LLM gateway that provides a lightweight interface to manage the complexity of working with multiple LLM providers.

Architecture

Unified API

Glide offers a comprehensive API that facilitates interaction with multiple LLM providers. Instead of dedicating considerable time and resources to developing custom integrations for individual LLM providers, Glide provides a single API interface that allows users to interact with any LLM provider. Adopting this approach can significantly enhance application development efficiency. By working off a standardized API, engineers can minimize complexity and development time, leading to faster and more efficient application development. Additionally, there is zero LLM model lock-in, as underlying models can be switched without knowledge from the client application.

Glide Routers

A fundamental principle in Glide is the concept of routers. Routers enable you to group models together for shared logic. An excellent example of this is illustrated by a RAG power chatbot, which allows users to search over a documentation set. It is directly built on GPT-3.5 Turbo and entirely depends on OpenAI to keep its API operable. This dependency poses a significant risk to the application and user experience. Therefore, it is recommended to set up a Glide router in resilience mode by adding a single backup model to a router. If the OpenAI API fails, Glide will automatically send the API call to the next model specified in the configuration. In addition, model failure knowledge is shared across all routers, reducing wasteful retries when an LLM provider has a known issue.

Another essential router type is the least-latency router. This router selects the model with the lowest average latency per generated token. Since we don’t know the actual distribution of model latencies, we attempt to estimate it and keep it updated over time. Over time, old latency data is weighted lower and eventually dropped from the calculation. This ensures latencies are constantly updated. As with all routers, if a model becomes unhealthy, it will pick the second-best, etc.

Other routing modes are available, such as round-robin, which is excellent for A/B testing, and weighted round-robin, which helps specify the percentage of traffic that should be sent to a set of models.

One Glide deployment can support multiple applications with diverse requirements since it can support numerous routers. There are also exciting routers on the roadmap, such as intelligent routing, which ensures your request is sent to the model best suited for that request.

Declarative Configuration

Glide simplifies the setup process through declarative configuration, which defines the state of the Glide gateway in one place. This also means that secret management is centralized, enabling the rotation of API keys from a single location.

Furthermore, this approach enables the separation of responsibilities between teams. One team can manage the infrastructure, deploy Glide, and make it available to other teams (such as AI/DS teams) while also being responsible for rotating keys. Meanwhile, other teams can solely focus on working with models and not worry about these configurations.

Here is a bare-bones configuration example:

routers:
  language:
    - id: my-chat-app
      strategy: priority
      models:
        - id: primary
          openai:
            model: "gpt-3.5-turbo"
            api_key: ${env:OPENAI_API_KEY}
        - id: secondary
          azureopenai:
            api_key: ${env:AZUREOAI_API_KEY}
            model: "glide-GPT-35" # the Azure OpenAI deployment name
            base_url: "https://mydeployment.openai.azure.com/"

With this simple configuration a priority/fallback router has been created. All requests will be sent to OpenAI, should the OpenAI API fail the request will be sent to an Azure OpenAI deployment.

What’s Next?

The future of LLM applications will be multi-modal, with text, speech, and vision models employed together to create rich user experiences. Glide will be the go-to gateway for these applications. Glide plans to support various features over the next several months, including exact and semantic caching, embedding endpoints, speech endpoints, safety policies, and monitoring features.

If you are interested in using Glide, here is a list of links for you to check out:

🛠️ Github

📚 Docs

💬 Discord

🗺️ Roadmap