Large Language Models (LLMs) are quickly changing the way we interact with technology and shaping the future of our economy, worldwide. According to McKinsey, generative AI features, which form the backbone of LLMs, have the potential to boost the global economy by up to $4.4 trillion annually. These powerful models can produce human-like text, translate languages accurately, and create various types of content with impressive skill.
However, building an LLM from scratch is no easy feat. It involves a lot of technical challenges, from gathering and preparing data to training and fine-tuning the model. This process demands not only technical expertise but also a lot of perseverance and creativity. For the brave researchers and engineers ready to dive into this complex task, this guide is your essential handbook for navigating the challenges of building an LLM from the ground up and mastering the art of LLM creation.
1. Getting the data right takes time
LLMs consume vast amounts of data, and multilingual support is scarce – so building a multi-stage data pipeline takes time. Data lineage tracking tools help teams understand data origin and changes for quality and reproducibility. It is also important to track various data versions across different preprocessing steps. Data versioning tools such as DVC can help maintain consistency and manage updates.
Data pipelines transform the raw data into various formats for better processing. Keeping track of the instructions for data pipeline versions helps teams experiment with different approaches on existing data sets or new versions and revert to the old recipe when they don’t. Open-source tools like Spark empower teams to scale the execution of data processing across a large number of computers.
Others like Airflow and Prefect can orchestrate complex data pipelines and are essential for a robust data preparation process. Nebius’ own TractoAI is an end-to-end solution for data preparation and exploration, which helps anyone daring to take their first steps on this journey and connects these different capabilities together.
2. Experimentation is a given
The next step in creating an LLM lies in experimenting with tools to help extend the utilisation of what seems like a good process to work at a greater scale. There are many ways things can go wrong in trying to scale up a new LLM, including problems with the training data, the choice of LLM models, and how they are scaled across multiple computers. Developers must consider scaling the training process across numerous computers, assessing data quality, and validating model architectures.
Teams need to maintain detailed records for reproducibility and track how changes in the training process affect the final results – such tools as MLFlow or solutions of Weights and Biases can be used at this stage. When experimenting, researchers need to focus on two key aspects – whether the idea works and whether the idea scales. With that in mind, researchers want to start small – on as little as 8 GPUs to test feasibility. If this works, they can scale it up to 32-64 GPUs for a day to validate scalability. Next, scale it up to 128 or more GPUs for week-long training to ensure robustness.
3. Make Pre-training a priority
Pre-training requires a substantial amount of computational resources, leading developers to seek out external clusters. Variations in data center architectures can lead to performance inconsistencies or failures, resulting in stability issues – such as restarts – which are both time-consuming and costly.
There are many different ways to run batches of data across GPU clusters, and the options can vary depending on each cloud provider’s approach. The best architectures use NVIDIA’s Collective Communication Libraries (NCCL), which allows GPUs to share updates using a peer-to-peer approach. This keeps each compute node on the same page with less networking overhead. Teams should consider agreeing on a proof of concept, rigorously testing the cluster performance on a variety of real workloads and tests, e.g. NCCL, then if the tests pass, shortlist the most reliable providers and move to a long-term contract.
4. Don’t forget to Checkpoint
It’s important to save intermediate checkpoints every hour on large training runs in case a training run crashes. This ensures you can restart from where you left off without requiring days or weeks for a large run. You don’t necessarily need to save each one. Still, it’s also a good idea to save daily checkpoints in case some of the training assumptions about model architecture lead to problems like gradient explosion.
Also, you should explore model and infrastructure architectures that allow you to back up checkpoints from RAM during the training process, which allows the training process to continue during backup. Model sharding and different combinations of data and model parallelism can improve the backup process. Open-source tools like Jax Orbax or PyTorch Lightening can help automate the checkpoint process. In addition to this, utilising storage, which is optimised for checkpoints is key.
5. Achieving Alignment and Optimal Performance
The final stage involves further experimentation but with a lighter computational footprint. It’s crucial to track and benchmark experiments to achieve successful alignment and optimal performance. It is also important to use universal methods that can streamline the alignment process.
To conclude, building an effective LLM does not have to be an insurmountable challenge. While it requires meticulous attention and numerous steps to achieve good results for new use cases, languages, and domains, it is a task that can be achieved with a well-structured approach.
A comprehensive plan should encompass key elements such as Data Preparation, Model Validation and Experimentation, Pre-Training on Large Clusters, Implementing Checkpoints, and Ensuring Alignment. These steps are critical in developing a robust, efficient, and fair model, ultimately leading to a more reliable and impactful AI platform.