CodeQL Guide for a Security Engineer (Part 3 of 6)
March 7th, 2025 by Brian
This is the third post of a six part blog series where I cover the use of CodeQL and Semgrep in bug-hunting. In the previous installments of the series, I provided examples of custom queries and rules that identify CVEs in open-source libraries. In this section, I will be diving deeper into CodeQL, specifically how to write security-focused queries and make them generalizable.
Due to its rich syntax and data flow analysis engine, CodeQL is a more poweful code analysis tool compared to Semgrep (in our experience) for detecting specific types of security vulnerabilities. Its built-in support for things like type conversions, identification of common function signatures, bound checking, etc., allows us to write queries for tricky bug cases of which can be difficult to detect by manual analysis.
However, CodeQL can be somewhat difficult to learn, especially when first starting out. It has its own extractors for each language that it supports, and the way that you query for code structures in different programming languages can be quite different. There are excellent language guides, but without spending a lot of time meticuously combing through them and experimenting with different code cases, it can be difficult to figure out which parts are the most important.
Goal
In this post, I want to offer a guide through some of the most important things I learned about working with CodeQL. I will be primarily working off C/C++ queries, since these are the two most common languages we’re interested in for embedded software. Included are techniques that I found has helped me write generalizable queries that are able to detect a number types of common bug classes (including buffer overflow, out of bound access, integer overflow, etc.) The intention of this blog series is to provide helpful concepts and that can aid you in the process of writing your own security queries.
I will focus on two primary techniques that will best enhance the generalizability of your queries:
- Writing identifiers for common functions
- Working with pointers + function wrappers
Then in the following part to this blog, I will talk about how you should format and organize your queries, provide helpful utility predicates, and give tips on to work with complex data flow analysis.
Workflow
Reader assumption: This blog assumes you already have some preliminary knowledge of how CodeQL works and some of the basic CodeQL syntax for C/C++. Otherwise, I would recommend getting started with the introduction to CodeQL tutorials and working through the Github 2020 workshop. The Trail of Bits guide is also excellent for working with C/C++ syntax.
Targets: In my experimentation process, I mainly targeted open-source libraries such as libcurl, libTIFF, BlueZ, connman, etc. The objective for my study was to build out a database of CodeQL queries that detects common bug types. When optimizing my queries, I focused on specificity (the ability for a query to detect a specific bug with low false positives) and generalizability (the ability for a query to able to detect other bugs like it within the same and different codebase).
Setup: I primarily used the CodeQL Vscode extension for my query running and testing, alongside with a local copy of the CodeQL CLI. The CLI is necessary to create custom CodeQL databases of targeted codebases via the codeql database create
command.
1. Detecting Common Function Signatures
When writing a CodeQL query, one of the main focus is to identify data source and sinks of interest. In many cases, this process will involve some form of function call. For example, when writing queries for memory allocation bugs we might target malloc()
and new()
calls, while queries for command injections could look for read()
calls as data sources and system()
or exec()
calls as data sinks.
On the surface, searching for a function calls may seem like a simple task. However, if your focus is to write a query that can be generalized across many codebases, it can be quite tricky to get right. The query also gets more complex when targeting a number of functions that perform similar behaviors but differs semantically – either in the number of arguments or ordering of arguments.
Take the following example: When querying for out-of-bound reads or writes, we are often interested in calls to functions like gets
, or memcpy
, or strcpy
. For now, let’s focus on memcpy
.
In CodeQL, a common way to do target specific function signatures is to search by the function’s name:
Idea 1: Filter by direct function name
from FunctionCall call
where call.getTarget().hasName("memcpy")
select call
However, one issue is that functions like memcpy can have multiple variants. It may be referred to as __builtin___memcpy_chk
(generated by GCC) or __memcpy
(seen in older Linux kernel source). Consider the case of libTIFF, a common library used to parse and manipulate TIFF images. It provides its own implementation _TIFFmemcpy
, which is just a custom wrapper around the original memcpy:
//Inside tif_unix.c
void
_TIFFmemcpy(void* d, const void* s, tmsize_t c)
{
memcpy(d, s, (size_t) c);
}
Due to this pattern, our original query would not be able to detect calls to _TIFFmemcpy
. Instead, we could use regex matching to allow for greater flexibility.
Idea 2: Filter by regex name pattern
from FunctionCall call
where call.getTarget().getName().regexpMatch(".*memcpy.*")
select call
Here we’re searching for any function with “memcpy” in the name using the string regexpMatch
predicate. Depending on the use case, it may be necessary to provide a stricter regex. For example, during my testing I’ve created the following pattern, which will also detect variants like wmemcpy
, memccpy
, and wmemcpy_s
.
call.getTarget().getName().regexpMatch("(?i)^\\w*mem(c)?cpy\\w*\\s*$")
In many cases this may be a good enough solution. But what if we want to also include functions that perform effectively the same operation as memcpy
but with slightly different arguments or constraints, such as memmove or bcopy?
It turns out that this problem isn’t unique to us. Over time CodeQL has provided a set of implementation models for common function, such as memcpy
and malloc
, which can greatly simplify our query writing. For memcpy there exists a semmle.code.cpp.models.implementations.Memcpy
module which provides theMemcpyFunction class:
Idea 3: Using CodeQL implementation classes
private class MemcpyFunction extends ArrayFunction, DataFlowFunction, SideEffectFunction,
AliasFunction, NonThrowingFunction
{
MemcpyFunction() {
// memcpy(dest, src, num)
// memmove(dest, src, num)
// memmove(dest, src, num, remaining)
this.hasGlobalOrStdOrBslName(["memcpy", "memmove"])
or
// bcopy(src, dest, num)
// mempcpy(dest, src, num)
// memccpy(dest, src, c, n)
this.hasGlobalName(["bcopy", mempcpy(), "memccpy", "__builtin___memcpy_chk"])
}
Here it is able to recognize memcpy-like functions like memmove
and bcopy
as part of the definition. In addition, the class also provides helpful predicates for determining the argument index of inputs:
int getParamDest()
– Gets the index of the parameter that is the destination buffer for the copy.int getParamSize()
– Gets the index of the parameter that is the size of the copy (in bytes).int getParamSrc()
– Gets the index of the parameter that is the source buffer for the copy.
With this built-in logic, we don’t have to worry about the ordering difference between memcpy(dst, src, len)
vs. bcopy(src, dst, len)
when fetching the corresponding buffer or length expressions, because that information is abstracted away as a helper predicate. We can simply use it as:
from FunctionCall call, MemcpyFunction memcpy, Expr dst, Expr src, Expr len
where call.getTarget() = memcpy
and call.getArgument(memcpy.getParamDest()) = dst
and call.getArgument(memcpy.getParamSrc()) = src
and call.getArgument(memcpy.getParamSize()) = len
select call, dst, src, len
However, before we can run this query, we observe that the definition of MemcpyFunction
is tagged as a private class
, which prevents it from being imported outside of the module. I’m actually not entirely sure why this is the case, as the class seems to be not be used anywhere else in the CodeQL codebase.
Nevertheless, for a local instance, we can edit the Memcpy.qll
file directly to change the definition from private class
to class
. For me, the file was located at /home/$USER/.codeql/packages/codeql/cpp-all/2.0.0/semmle/code/cpp/models/implementations/Memcpy.qll
Alternatively, for more portable solutions, I ended up creating my own “Models.qll” file containing a copy of the MemcpyFunction
class among other useful models.
In particular, CodeQL has made available function models for:
recv
:semmle.code.cpp.models.implementations.Recv
scanf
:semmle.code.cpp.models.implementations.Scanf
memset
:semmle.code.cpp.models.implementations.Memset
- and many more.
The utility and usefulness for each default implementation varies; some are more complete searches than others, but they serve as excellent starting points for developing a query class that targets a specific function or set of functions.
2. Dealing with Global Function Pointers and Wrappers
In the last section, we talked about ways which you can search for common function calls that may have different names or variants across different codebases. However, one assumption we’ve made is that these calls are of the type FunctionCall, where the name of the target function is known at compile-time.
Part 1: Function pointers
However, with C/C++ and many other languages, functions can also be invoked through function pointers, where a variable stores a reference to a function that is then called in the code. Function pointers are also useful for defining callbacks, which may be different depending on the class or implementation type.
For example, for libcurl, a highly popular network library, if you were to write a query for a FunctionCall
that invokes the default malloc
function, you will find very few usage, all of which are in the standalone tools that is separate from the main library. However, if you look up the string “malloc(” in the source code, it will result in a bit more than 200 matches (Note: This is somewhat inflated due to the test suites and docs). So what are we missing?
After a little digging, I found for libcurl redefines the keyword “malloc” in a file called memdebug.h
to be of a function pointer. Depending on the configuration, malloc
can either be assigned to the function pointer Curl_cmalloc
, or if the CURLDEBUG flag is set, it will be set to Curl_dbg_malloc
.
// Inside memdebug.h
#ifndef CURLDEBUG
#undef strdup
#define strdup(ptr) Curl_cstrdup(ptr)
#undef malloc
#define malloc(size) Curl_cmalloc(size)
[...]
Going further, we see the definition of Curl_cmalloc
as a function pointer to the original malloc
:
// Inside easy.c
curl_malloc_callback Curl_cmalloc = (curl_malloc_callback)malloc;
//Inside curl.h
#ifndef CURL_DID_MEMORY_FUNC_TYPEDEFS
/*
* The following typedef's are signatures of malloc, free, realloc, strdup and
* calloc respectively. Function pointers of these types can be passed to the
* curl_global_init_mem() function to set user defined memory management
* callback routines.
*/
typedef void *(*curl_malloc_callback)(size_t size);
Thus, in the libcurl library, the keyword malloc
is redefined as Curl_cmalloc
(assuming CURLDEBUG is not set), which is a variable that is as a function pointer for the original malloc
.
When it comes to CodeQL, I notice that the compiler automatically resolves any #define
keyword for you without adding a level of indirection. So any malloc(...)
call in the code is effectively is treated the same as Curl_cmalloc(...)
. Knowing this, we can search for all malloc calls as follows:
from VariableCall call, Variable var
where
call.getVariable() = var
and var.hasName("Curl_cmalloc")
select call, var
But what if you don’t know from the beginning that Curl_cmalloc
is a function pointer for the original malloc
? Is there a more programmatic way of doing this?
Indeed we can. When I was writing my queries I came up with this utility predicate:
//Checks if variable function pointer points to our function
private predicate variableWrapsFunction(Variable v, Function wrapped){
v.isTopLevel()
and v.getAnAssignedValue().getAChild*() = wrapped.getAnAccess()
}
For a given Function wrapped
, the query tries to find if there is a top-level variable v
such that there is an assignment to v
somewhere in the code that points to wrapped. So any trivial definition such as:
curl_malloc_callback Curl_cmalloc = (curl_malloc_callback)malloc;
will be discovered by the predicate.
I added the constraint to only look for top-level variables because for the libraries I’ve been working with, I’ve only seen wrappers for common functions declared at the global level.
You can then use the predicate as follows:
from VariableCall call, Variable var, Function malloc
where
malloc.hasName("malloc") and
var = call.getVariable() and
variableWrapsFunction(var, malloc)
select call, var, malloc
Notice that this query improves upon the first one because you don’t need to specify that Curl_cmalloc
or Curl_dbg_malloc
wraps around malloc
, only that a wrapper for malloc
exists somewhere. To be specific, when I say something “wraps” a function, I mean that calling function pointer is effectively a reassignment of the original function, and thus calling the wrapper is effectively the same as calling the original function directly.
Part 2: Function Wrappers
Now we’ve discussed function pointers, but what about functions wrapping around other functions?
This is where things get a little tricky. If you recall from the first section, libTIFF has the tendency to write their own version of common functions like memcpy
, free
, memcmp
, etc. This is also common in other libraries where they want to add things like:
- Additional bound checking
- Additional arguments (flags, etc.)
- Trivial type conversions (
uint32_t
tosize_t
) - Edge cases
For example, for libTIFF, inside tif_unix.c
, it redefines malloc
, calloc
, and free
:
void*
_TIFFmalloc(tmsize_t s)
{
if (s == 0)
return ((void *) NULL);
return (malloc((size_t) s));
}
void* _TIFFcalloc(tmsize_t nmemb, tmsize_t siz)
{
if( nmemb == 0 || siz == 0 )
return ((void *) NULL);
return calloc((size_t) nmemb, (size_t)siz);
}
void
_TIFFfree(void* p)
{
free(p);
}
There are some important distinctions between function pointers and function wrappers:
- Function pointers (when called) are usually
VariableCall
, where the pointer is a direct reference to the original function - Function wrappers (when called) are usually of type
FunctionCall
, where it is a separate function that calls the original function inside the body
In addition, while it’s simple to find a pointer that references a given function, there is’s no exact science in terms of how to determine if a function is intended to be a wrapper around another function vs. simply using it to perform some task.
Indeed CodeQL has a FunctionWithWrapper
Semmle library that is meant to help with this issue. In it, it defines a FunctionWithWrappers
class that detects when a function is likely being wrapped. I won’t go too much into the specifics, but one way to use it for memcpy
for example, looks as follows. Notice that we are using the Semmle implementation MemcpyFunction
class as defined in the first section:
import cpp
import semmle.code.cpp.security.FunctionWithWrappers
import semmle.code.cpp.models.implementations.Memcpy
class MemcpyWithWrapper extends FunctionWithWrappers instanceof MemcpyFunction {
MemcpyWithWrapper() {
this.getEffectiveNumberOfParameters() = 3
}
override predicate interestingArg(int arg) {
arg in [0..2]
}
}
class MemcpyWrapper extends Function {
MemcpyWrapper() {
exists( MemcpyWithWrapper memcpy |
forall( int arg_index
| memcpy.wrapperFunction(this, arg_index, _)
| arg_index in [0..2]
)
)
}
predicate getWrapped(MemcpyWithWrapper wrapped, string cause){
wrapped.wrapperFunction(this, 0, cause)
or wrapped.wrapperFunction(this, 1, cause)
or wrapped.wrapperFunction(this, 2, cause)
}
}
from MemcpyWrapper memcpy, MemcpyWithWrapper wrapped, string cause
where memcpy.getWrapped(wrapped, cause)
select wrapped, memcpy, cause
If you run this on the libTIFF v4.4 codebase, you will see that it is able to detect _TIFFmemcpy
indeed calls memcpy
, and it will flag all usages of _TIFFmemcpy
. However, one drawback with this module is that it recurse on wrapper calls (up to 4 levels up), meaning it will report on functions that calls the wrapper function _TIFFmemcpy
, and functions that call that function, and so forth. You can see this in the “cause” string it returns:
_TIFFsetDoubleArray(dp), which calls setByteArray(vp), which calls _TIFFmemcpy(s), which calls memcpy(__src)
TIFFReadTile(buf), which calls TIFFReadEncodedTile(buf), which calls TIFFReadRawTile1(buf), which calls _TIFFmemcpy(d), which calls memcpy(__dest)
which is not always what we want. Seeing this, I decided to write my own predicate for detecting a function wrapper:
Part 2: Function wrapping around other functions
//Checks if function wraps another function
//We make our determination based on a few simple heuristics:
private predicate functionWrapsFunction(Function func, Function wrapped) {
//Check 1: Same number of parameters
func.getNumberOfParameters() = wrapped.getNumberOfParameters()
and exists(
FunctionCall wrapped_call |
wrapped = wrapped_call.getTarget()
//Check 2: `func` calls `wrapped`
and func = wrapped_call.getEnclosingFunction()
//Check 3: All arguments from `func` are fed into `wrapped` in the same order
and forall(int i | func.getParameter(i).getAnAccess() = wrapped_call.getArgument(i) | i in [0..wrapped_call.getNumberOfArguments()])
and
(
if func.getType() instanceof VoidType
then func = func
//Check 4: If `func` has a return type, make sure result of `wrapped` is returned
else func.getBlock().getAStmt().(ReturnStmt).getExpr().getAChild*() = wrapped_call
)
)
//Check 5: (Heuristic) `func` isn't too large
and func.getBlock().getNumStmt() < 10 //Loose boundary to prevent from getting oversized functions
}
In functionWrapsFunction
, I applied a few heuristics to help me determine if a function wraps another function:
- Function calls our wrapped function
- Function has same number of arguments as our wrapped function
- Function feeds the arguments in the same order into our wrapped function
- If our function returns, it must return the result of our wrapped function
- Function isn’t too large (not greater than X number of statements – we say 10)
The most important piece seems to be the last requirements, where I restrict the number of lines the function wrapper can have. Usually function wrappers add a small amount of logic (if any), and so we can avoid a lot of false positives by requiring their body to be short.
Admittedly, this is still an imperfection solution. As a standalone the functionWrapsFunction
predicate would give some false positives for wrapped functions. However, when used in conjunction inside a larger query (where you have constraints on how the function results are used or variables surrounding its context), I’ve found any falsely flagged wrapped functions tends to be ignored and ultimately doesn’t make it into the findings. This allows for a more generalized query to find usage of our targeted functions.
Part 3: Putting it all together
Now, with these two parts, we can put together what we know about function wrappers into one unified class definition:
//Checks if the call expression is essentially a wrapper
//function around our target function
predicate callWrapsFunction(Call call, Function wrapped) {
(
call instanceof VariableCall
and variableWrapsFunction(call.(VariableCall).getVariable(), wrapped) //Function pointer wraps function
)
or
(
call instanceof FunctionCall
and functionWrapsFunction(call.(FunctionCall).getTarget(), wrapped) //Normal function wraps function
)
}
//Representation of a `memcpy` function call
//Could be either direct or wrapped
class MemcpyCall extends Call {
MemcpyCall() {
this.getTarget() instanceof MemcpyFunction
or
(
not this.getTarget() instanceof MemcpyFunction //Avoid wrapping known memcpy functions
and exists( MemcpyFunction memcpy | callWrapsFunction(this, memcpy))
)
}
MemcpyFunction getMemcpyFunc() {
if this.getTarget() instanceof MemcpyFunction
//Direct fetch
then result = this.getTarget()
//Indirect fetch
else exists(MemcpyFunction memcpy | callWrapsFunction(this, memcpy) and result = memcpy)
}
Expr getDst() {
result = this.getArgument(this.getMemcpyFunc().getParamDest())
}
Expr getSrc() {
result = this.getArgument(this.getMemcpyFunc().getParamSrc())
}
Expr getLen() {
result = this.getArgument(this.getMemcpyFunc().getParamSize())
}
}
Here, we set our custom MemcpyCall
class to be of instance Call
, which is a parent class that includes both FunctionCall
and VariableCall
. We search for whether it is a direct instance of memcpy
(based on the MemcpyFunction
signature), or if it’s a function or function pointer that wraps around memcpy
. I also added a few helper predicates so that we can fetch the subsequent arguments without having to care about the exact underlying Call
type.
With this class, I no longer have to go define a special class for memcpy
for every memory/buffer security query I’m writing. Instead, I pull from this one definition, and from my experience it is able to detect most usages of memcpy
without me having to writing specialized regex expressions or call modifications for each codebase.
Work conducted by Huy Dai.