Deepcode

This talk explores an agent-based auto-coder designed to outperform existing AI coding agents by intelligently managing code context, critiquing its own work, and iterating for robust solutions.

Overview

Agent-based auto-coder to exceed SWE-benchmark performance of Devin, Code Droid, and others.
My demo is basic and buggy (https://deepcode.streamlit.app/). Hacked this out in three weeks. Currently pulls your github repo, analyses the relationships, pulls the correct files, and then iteratively answers while critiquing itself and catching errors. Moving at a great speed in the right direction though.

Here are some of my unstructured project notes:

Better context pulling:

We create and hand to the LLM a json that represents the user’s repo file and function structure and relationships. It chooses n files to open.
That source code is not shown to the assistant LLM yet.

Why?

Providing more files than necessary can lead to worse performance for fixing code due to increased complexity, context overload, and noise. The impact varies (guesstimates):

Minimal Impact (10-20%): Closely related and small additional files.
Moderate Impact (20-40%): Moderate complexity and noise introduced by extra files.

Instead of showing all source code to the assistant LLM, another LLM get’s to copy-paste, and comment on a subset of the source code. This works extremely well, and focusses the assistant far better.

Having the LLM copy and paste

Instead of returning the source code, it just returns unique segments of the source code string in between start and stop tags, and keeps comments very short. This cuts down on time.

Won’t LLM’s just get ever increasing context windows?

Yes. Tokens will get cheaper too. However, proper outside context management currently de-noise the input far better than LLM-internal attention mechanisms. At any given level of intelligence of the LLM we’re plugging in, that will make it smarter, and able to handle larger and more complex codebases.

At some point AGI will do everything, and before then, LLM’s will have vastly improved default attention mechanisms. Until then my approach appears to be advantageous.

Additional context pull

Occasionally a vector search for e.g. a variable name would do better. We can do both.

Improving context pulling

Right now the json is generated near instantly by a hard-coded algorithm that analyses which functions call which other ones, what the IO args are called etc. Repos for code that actually does somethign useful tend to have at least decently descriptive function names.

Refactoring a project with 1 million lines of code: LLM re-processing in varying abstraction levels

Based on the initial json, we can let LLM’s look at source files, and write their own comments on what this function does. After this, the LLM can “zoom out” and write a comment at a higher level of astraction. Zomming in and out, just like a human would, while taking notes, will eventually give the LLM a tremendous understanding at many different abstraction levels. This is crucial for performing well when implementing complex new features that span several modules.

Web search

In parallel to the assistant writing its first batch of code, scour the internet for up-to-date documentation and forum posts.

Iteration

Have another LLM provide some poignant critique.

Ensures that the newly generated code fulfils all the criteria of the users.
Points out how this could go wrong so that the assistant can fix it.
Ensures that the assistant LLM is not lazy (avoiding placeholders in the code)
Makes sure that the fixed code will interface nicely with existing code
Considers functionality of the existing code that wasn’t mentioned, and makes sure that it isn’t dropped
Enforces coding best practices

Improving context pulling (for send-and-forget code generation)

Finetune the instructor as well as the agent on the users codebase. Finetuning to date is not available for GPT4o, only for GTP3.5. Could for now retrain 3.5 and have it serve up relevant source code.

Closed-loop AI coding

Have the LLM run everything through a docker container, and feed the errors back into the system.

Adversarial coding

One set of workers (assistant, context puller, critic, web searcher, etc.) creates new features or implements a fix. They temporarily include lots of logging.
Another set of workers independently creates unit tests.

Twin system

In cases where the full repo cannot be run (e.g. proprietary data etc.), the system should just test what it can. This will reduce the confidence level of the solution, and -depending on the user setting- lead to a pull-request suggestion once it passes a certain level of confidence.

Trello ticket to pull-request

Let it see tickets and have a go at them in the background.

Whenever a fix passes the unit tests generated by the AI (or previously preasent in the code), it sends a PR.
If it doesn’t pass all tests, but thinks it has something really close to working, based on a certain threshold it can also send a PR.

Ideas (apparently) already implemented by Code Droid that seem to work

Multi-resolution project context
“We have found it critical to include tools like linters, static analyzers, and debuggers”
Planning, then decopmose into more manageable subtasks
Multiple different models for variety of thought
Dynamically generates diagrams that visually map out the architecture and workflows of the codebase for better comprehension

Links

Tech stack