Large language models and the rise of the AI ​​code generators - Brain Sonic (2023)

When I wrote about GitHub Copilot in November 2021, Copilot was one of only a handful of AI code generation technologies available. I tested it as a Visual Studio Code extension. At the time, Copilot didn't always generate good, correct, or even running code, but it was still somewhat useful. The big promise behind Copilot (and other code generators that use machine learning) is that it is designed to improve over time, both by incorporating user feedback and by ingesting new code examples into its training corpus.

As of May 2023, there are hundreds of "AI" or "code generation" extensions available for Visual Studio Code alone. Several of these can save you some time while coding, but if you believe their generated code without reviewing, testing and debugging it, I have a bridge to sell you.

Large language models and the rise of the AI ​​code generators - Brain Sonic (1) IDG

A promising development in this area is that several tools have automatic generation of unit tests. Generating unit tests is a much more manageable problem than generating generic code—in fact, it can be done using simple patterns—but you'll still need to review and run the generated tests to see if they make sense.

In the remainder of this article, I will provide a brief history of language models before examining the advanced large language models (LLMs), such as OpenAI's GPT family and Google's LaMDA and PaLM, used for text generation and code generation today. We finish with a quick tour of 10 code generation tools, including Amazon CodeWhisperer, Google Bard, and GitHub Copilot X.

A brief history of AI models for text generation

Language models date back to Andrey Markov in 1913. That field of study is now called Markov chains, a special case of Markov models. Markov showed it in Russian, specifically in Pushkin'sEugene Onegin, the probability of a letter appearing depends on the preceding letter, and that consonants and vowels generally tend to alternate. Markov's methods have since been generalized to words, to other languages, and to other language applications.

Markov's work was extended by Claude Shannon in 1948 to communication theory, and again by Fred Jelinek and Robert Mercer of IBM in 1985 to produce a language model based on cross-validation (which they called deleted estimates) and applied to real-time, large-vocabulary speech recognition . Essentially, a statistical language model assigns probabilities to sequences of words.

To quickly see a language model in action, type a few words into Google Search or a texting app on your smartphone and let it offer auto-completion.

In 2000 Yoshua Bengio et al. published an article about a neural probabilistic language model, where neural networks replace probability in a statistical language model, withoutthe curse of dimensionalityand improving the word predictions (based on previous words) over a smoothed trigram model (then state of the art) by 20% to 35%. The idea of ​​forward, auto-regressive, neural network models of language is still used today, although the models now have billions of parameters and are trained on extensive corpora, hence the term "large language models."

As we will see, language models have continued to grow larger over time to make them perform better. However, there are costs to this. The 2021 paper on the perils of stochastic parrots: Can language models be too big? by Emily Bender, Timnit Gebru, et al. question whether we are going too far with this trend. Among other things, the authors suggest that we should first weigh the environmental and financial costs and invest resources in curating and carefully documenting datasets rather than consuming everything online.

Both Gebru and Bender subsequently lost their jobs at Google for essentially pointing out that the emperor has no clothes. Bender is now at the University of Washington. Gebru founded the Distributed AI Research Institute.

Large language models for text generation

The recent explosion of large language models was sparked by the 2017 paper Attention is All You Need by Ashish Vaswani et al. by Google Brain and Google Research. That paper introduced "a new simple network architecture, the Transformer, based entirely on attentional mechanisms that completely dispense with repetitions and convolutions." Transformer models are both simpler than and superior to recurrent and folded models. They also require significantly less time to train.

ELMo

ELMo is a 2018 deep contextualized word representation from AllenNLP (see ELMo paper) that models both complex characteristics of word use (e.g. syntax and semantics) and how these uses vary across linguistic contexts (i.e. modeling polysemy). The original model has 93.6 million parameters and was trained on the One Billion Word Benchmark.

BERT

BERT is a 2018 language model from Google AI Language based on the company's Transformer (2017) neural network architecture (see BERT paper). BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning both left and right contexts in all layers. The two model sizes used in the original paper were 100 million and 340 million total parameters. BERT uses masked language modeling (MLM), where ~15% of tokens are "destroyed" for training. It was trained on English Wikipedia plus the Toronto Book Corpus.

T5

The 2020 Text-To-Text Transfer Transformer (T5) model from Google (see the T5 paper) synthesizes a new model based on the best transfer learning techniques from GPT, ULMFiT, ELMo and BERT and their successors using a new open source, dataset before training called Colossal Clean Crawled Corpus (C4). The standard C4 for English is an 800 GB dataset based on the Common Crawl dataset. T5 transforms all natural language processing tasks into a unified text-to-text format, where input and output are always text strings, unlike BERT-style models that output only a class label or array of the input. The base T5 model has about 220 million total parameters.

GT family

OpenAI, an AI research and implementation company, has a mission "to ensure that artificial general intelligence (AGI) benefits all of humanity." Of course, OpenAI hasn't achieved AGI yet. And some AI researchers, such as machine learning pioneer Yann LeCun of Meta-FAIR, believe that OpenAI's current approach to AGI is a dead end.

OpenAI is responsible for the GPT family of language models, which are available through the OpenAI API and Microsoft's Azure OpenAI Service. Note that the entire GPT family is based on Google's 2017 Transformer neural network architecture, which is legitimate because Google open sourced Transformer.

GPT (Generative Pretrained Transformer) is a 2018 model from OpenAI that uses about 117 million parameters (see GPT paper). GPT is a unidirectional transformer that was pre-trained on the Toronto Book Corpus and was trained with a purpose of causal language modeling (CLM), meaning that it was trained to predict the next token in a sequence.

Also read: Microsoft offers Visual Studio IDE extension for .NET upgrades

GPT-2 is a 2019 direct upscaling of GPT by 1.5billionparameters, trained on a dataset of eight million web pages or ~40 GB of text data. OpenAI initially restricted access to GPT-2 because it was "too good" and would lead to "fake news". The company eventually gave up, although the potential social problems became even worse with the release of the GPT-3.

GPT-3 is a 2020 autoregressive language model with 175 billion parameters, trained on a combination of a filtered version of Common Crawl, WebText2, Books1, Books2 and English Wikipedia (see GPT-3 paper). The neural network used in GPT-3 is similar to that of GPT-2, with a few extra blocks.

The main disadvantage of GPT-3 is that it tends to "hallucinate", in other words make up facts without any discernible basis. GPT-3.5 and GPT-4 have the same problem, albeit to a lesser extent.

CODEX is a 2021 descendant of GPT-3 that was fine-tuned for code generation on 54 million open source GitHub repositories. This is the model used in GitHub Copilot, which I discuss in the next section.

GPT-3.5 is a set of 2022 updates to GPT-3 and CODEX. The gpt-3.5 turbo model is optimized for chat, but also works well for traditional completion tasks.

GPT-4 is a 2023 large multimodal model (accepts image and text input, emits text output) that OpenAI claims exhibits human-level performance on various professional and academic benchmarks. The GPT-4 outperformed the GPT-3.5 in a variety of simulated exams, including the Uniform Bar Exam, LSAT, GRE, and several AP subject exams.

It is of serious concern that OpenAI has not explained how GPT-4 was trained; the company says it's for competitive reasons, which makes sense given the competition between Microsoft (which has funded OpenAI) and Google. Still, not knowing the biases in the training corpus means that we don't know the biases in the model. Emily Bender's take on GPT-4 (posted on Mastodon on March 16, 2023) is that "GPT-4 should be assumed to be toxic garbage until and unless #OpenAI is *open* about its training data, model architecture, etc."

ChatGPT and BingGPT are chatbots originally based on gpt-3.5-turbo and in March 2023 upgraded to use GPT-4. Currently, you need to subscribe to ChatGPT Plus to access the version of ChatGPT based on GPT-4. The standard ChatGPT, based on GPT-3.5, was trained on data that discontinued in September 2021. BingGPT, which you can access in the Microsoft Edge browser, was also trained on data that discontinued in 2021, but says (when you ask it), that "I am constantly learning and updating my knowledge with new information from the web."

Large language models and the rise of the AI ​​code generators - Brain Sonic (2) IDG

In early March 2023, Pascale Fung from the Center for Artificial Intelligence Research at the Hong Kong University of Science & Technology gave a lecture on ChatGPT evaluation. It's worth spending an hour watching it.

LaMDA

LaMDA (Language Model for Dialogue Applications), Google's 2021 "breakthrough" conversational technology, is a 2017 Transformer model trained in dialogue and fine-tuned to significantly improve the sanity and specificity of its responses. One of LaMDA's strengths is that it can handle the topic drift common in human conversations.

A version of LaMDA powers Bard, Google's conversational AI service. Bard was released on March 21, 2023 and made generally available on May 10, 2023. I discuss its code generation capabilities below.

Palm

PaLM (Pathways Language Model) is a 2022 dense decoder-only Transformer model from Google Research with 540 billion parameters, trained with the Pathways system (see the PaLM paper). PaLM was trained using a combination of English and multilingual datasets that include high-quality web documents, books, Wikipedia, conversations, and GitHub code.

Google also created a "lossless" vocabulary for PaLM that preserves all whitespace (especially important for code), splits non-vocabulary Unicode characters into bytes, and splits numbers into individual tokens, one for each digit. PaLM-Coder is a version of PaLM 540B fine-tuned on a Python-only code dataset.

PALM-E

PaLM-E is a 2023 "embodied" (for robotics) multimodal language model from Google. The researchers began with PaLM, a powerful large language model, and embodied it (the "E" in PaLM-E) by augmenting it with sensor data from the robotic agent. PaLM-E is also a generally capable vision-and-language model. In addition to PaLM, it incorporates the ViT-22B vision model.

LLaMA

LLaMA (Large Language Model Meta AI) is a "raw" large language model of 65 billion parameters released by Meta AI (aka Meta-FAIR) in February 2023. According to Meta, it is desirable to train smaller foundation models like LLaMA in the large language model space, because it requires far less computing power and resources to test new approaches, validate the work of others and explore new use cases. Foundation models train on a large set of unlabeled data, making them ideal for fine-tuning for a variety of tasks."

The LLaMA was released in several sizes along with a model card detailing how the model was built. Originally, you had to request checkpoints and tokenizer, but they are in the wild now, as a downloadable torrent was posted on 4chan by someone who correctly obtained the models by submitting a request, according to Yann LeCun of Meta-FAIR .

Specialized code generation products

While several major language models including ChatGPT and Bard can be used for code generation as released, fine-tuning them on some code, typically from free open source software, helps to avoid blatant copyright infringement. That still raises the specter of "open source software piracy," which is the allegation of a 2022 federal class-action lawsuit against GitHub, Microsoft (owner of GitHub), and OpenAI over the GitHub Copilot product and the OpenAI GPT Codex model.

Note that in addition to using AI models trained largely on publicly available code, some code generation tools rely on searching code sharing sites such as Stack Overflow.

Amazon CodeWhisperer

Amazon CodeWhisperer integrates with Visual Studio Code and JetBrain's IDEs, generates code suggestions in response to comments and code completions based on existing code, and can scan code for security issues. You can also enable CodeWhisperer for use in AWS Cloud9 and AWS Lambda.

CodeWhisperer supports Python, Java, JavaScript, TypeScript and C# programming languages ​​well, and 10 more programming languages ​​to a lesser extent. It is available for free for individual developers and costs $19 per user per month for professional teams.

CodeWhisperer helped me write the Python code shown below. I've reviewed, tested and debugged it and it's fine.

Large language models and the rise of the AI ​​code generators - Brain Sonic (3) IDG

Bard

Bard programming support was announced on April 21, 2023. The announcement notes support for more than 20 programming languages, including C++, Go, Java, JavaScript, TypeScript, and Python. As a quick test, I asked Bard to "write a Go function to return the current date and time." It did so quickly:

Large language models and the rise of the AI ​​code generators - Brain Sonic (4) IDG

Not only did Bard write the function, it also explained the function and generated an example of calling the function.

Top Articles
Latest Posts
Article information

Author: Dan Stracke

Last Updated: 04/28/2023

Views: 6169

Rating: 4.2 / 5 (43 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Dan Stracke

Birthday: 1992-08-25

Address: 2253 Brown Springs, East Alla, OH 38634-0309

Phone: +398735162064

Job: Investor Government Associate

Hobby: Shopping, LARPing, Scrapbooking, Surfing, Slacklining, Dance, Glassblowing

Introduction: My name is Dan Stracke, I am a homely, gleaming, glamorous, inquisitive, homely, gorgeous, light person who loves writing and wants to share my knowledge and understanding with you.