sysgrok is an experimental proof-of-concept, intended to demonstrate how LLMs can be used to help SWEs and SREs to understand systems, debug issues, and optimise performance.

It can do things like:

Take the top most expensive functions and processes identified by a profiler, explain
the functionality that each provides, and suggest optimisations.
Take a host and a description of a problem that host is encountering and automatically
debug the issue and suggest remediations.
Take source code that has been annotated by a profiler, explain the hot paths, and
suggest ways to improve the performance of the code.

Check out the release blog post for a longer introduction.

See the Command Overview section below for an overview of the full list of available commands that it supports.

Here's an example using the analyzecmd sub-command, which connects to a remote host, executes one or more commands, and summarises the result. The demo shows how this can be used to automate the process described in Brendan Gregg's article - Linux Performance Analysis in 60 seconds.

Installation

Copy .env.example to .env and fill the required variables. The GAI_API_TYPE must be either "azure" or "openai", and the GAI_API_KEY must be your API key. If you are using an Azure endpoint then you must also provide the GAI_API_BASE and GAI_API_VERSION variables. The correct values for these can be found in your Azure portal.
Install requirements via pip

$ python -m venv venv # Create a virtual environment
$ source venv/bin/activate # Activate the virtual environment
$ pip install -r requirements.txt # Install requirements in the virtual environment

Usage

For now, sysgrok is a command line tool and takes input either via stdin or from a file, depending on the command. Usage is as follows:

usage: ./sysgrok.py [-h] [-d] [-e] [-c] [--output-format OUTPUT_FORMAT] [-m MODEL] [--temperature TEMPERATURE] [--max-concurrent-queries MAX_CONCURRENT_QUERIES]
                    {analyzecmd,code,explainfunction,explainprocess,debughost,findfaster,stacktrace,topn} ...

                               _
 ___ _   _ ___  __ _ _ __ ___ | | __
/ __| | | / __|/ _` | '__/ _ \| |/ /
\__ \ |_| \__ \ (_| | | | (_) |   <
|___/\__, |___/\__, |_|  \___/|_|\_
     |___/     |___/

System analysis and optimisation with LLMs

positional arguments:
  {analyzecmd,code,explainfunction,explainprocess,debughost,findfaster,stacktrace,topn}
                        The sub-command to execute
    analyzecmd          Summarise the output of a command, optionally with respect to a problem under investigation
    code                Summarise profiler-annoted code and suggest optimisations
    explainfunction     Explain what a function does and suggest optimisations
    explainprocess      Explain what a process does and suggest optimisations
    debughost           Debug an issue by executing CLI tools and interpreting the output
    findfaster          Search for faster alternatives to a provided library or program
    stacktrace          Summarise a stack trace and suggest changes to optimise the software
    topn                Summarise Top-N output from a profiler and suggest improvements

options:
  -h, --help            show this help message and exit
  -d, --debug           Debug output
  -e, --echo-input      Echo the input provided to sysgrok. Useful when input is piped in and you want to see what it is
  -c, --chat            Enable interactive chat after each LLM response
  --output-format OUTPUT_FORMAT
                        Specify the output format for the LLM to use
  -m MODEL, --model-or-deployment-id MODEL
                        The OpenAI model, or Azure deployment ID, to use.
  --temperature TEMPERATURE
                        ChatGPT temperature. See OpenAI docs.
  --max-concurrent-queries MAX_CONCURRENT_QUERIES
                        Maximum number of parallel queries to OpenAI

Feature Requests, Bugs and Suggestions

Please log them via the Github Issues tab. If you have specific requests or bugs then great, but I'm also happy to discuss open-ended topics, future work, and ideas.

Adding a New Command

Adding a new command is easy. You need to:

Create a file, yourcommand.py, in the sysgrok/commands directory. It's
likely easiest to just copy an existing command, e.g. stacktrace.py
Your command file needs to have the following components:
- A top level command variable which is the name users will use to invoke
  your command
- A top level help variable describing the command, and which will appear
  when the -h flag is passed.
- A add_to_command_subparsers function which should add a sub-parser
  which will handle the command line arguments that are specific to your
  command.
- A run function that is the interface to your command. It will be the
  function that creates the LLM queries and produces a result. It should
  return 0 upon success, or -1 otherwise.
Update sysgrok.py:
- Add your command to the imports
- Add your command to the commands dict.

Examples

Note 1: The output of sysgrok is heavily dependent on the input prompts, and I am still experimenting with them. This means that the output you get from running sysgrok may differ from what you see below. It may also vary based on minor differences in your input format, or the usual quirks of LLMs. If you have an example where the output you get from a command is incorrect or otherwise not helpful, please file an issue and let me know. If I can figure out how to get sysgrok to act more usefully, I will.

Note 2: The output of sysgrok is also heavily dependent on the OpenAI model used. The default is gpt-3.5-turbo and these examples have been generated using that. See the usage docs for other options. If the results you get are not good enough with gpt-3.5-turbo then try gpt-4. It is slower and more expensive but may provide higher quality output.

Finding faster replacement programs and libraries

One of the easiest wins in optimisation is replacing an existing library with a functionally equivalent, faster, alternative. Usually this process begins with an engineering looking at the Top-N output of their profiler, which lists the most expensive libraries and functions in their system, and then going on a hunt for an optimised version. The findfaster command solves this problem for you. Here are some examples.

Analysing the TopN functions and suggesting optimisations

A good starting place for analysing a system is often looking at what libraries and functions your profiler tells you are using the most CPU. Most commercial profilers will have a tab for this information, and you can get it from perf via perf report --stdio --max-stack=0. One of the stumbling blocks when encountering this data is that firstly you need to understand what each program, library and function actually is, and then you need to come up with ideas for how to optimise them. This is made even more complicated in the world of whole-system, or whole-data-center, profiling, where there are a huge number of programs and libraries running, and you are often unfamiliar with many of them.

The sysgrok topn command helps with this. Provide it with your Top-N, and it will try to summarise what each program, library and function is doing, as well as providing you with some suggestions as to what you might do to optimise your system.

Explaining a specific function and suggesting optimisations

If you know a particular function is using signficant CPU then, using the explainfunction command, you can ask for an explanation of that specific function, and for optimistion suggestions, instead of asking about the entire Top N.

Executing commands on a remote host and analysing the response using the LLM

The analyzecmd command takes a host and one or more Linux commands to execute. It connects to the host, executes the commands and summarises the results. You can also provide it with an optional description of an issue you are investigating, and the command output will be summarised with respect to that problem.

This first example shows how one or more commands can be executed.

We can also use analyzecmd to analyse commands that produce logs.

And in this final example we execute and analyze the commands recommended by Brendan Gregg in his article "Linux Performance Analysis in 60 seconds".

Automatically debug a problem on a host

The debughost command takes a host and a problem description and then:

Queries the LLM for commands to run that may generate information useful in
debugging the problem.
Connects to the host via ssh and executes the commands
Uses the LLM to summarise the output of each command, individually.
Concatenates the summaries and passes them to the LLM to ask for a report on
the likely source of the problem the user is facing.

Badges

Extracted from project README