Penetration Testing

Python Serialization Vulnerabilities – Pickle

Introduction

Serialization gathers data from objects, converts them to a string of bytes, and writes them to disk. The data can be deserialized and the original objects can be recreated. Many programming languages offer a way to do this including PHP, Java, Ruby and Python (common backend coding languages in web).

Let’s talk about serialization in Python. In Python, when we use the pickle module, serialization is called “pickling.”

Table of content

  • Serialization in Python
  • Serialization in Web Applications
  • Over Pickling
  • Python YAML vs Python Pickle
  • Mitigation
  • Demonstration
  • Conclusion

Serialization in Python

While using Python, pickle.dumps() is used to serialize some data and pickle.loads() is used to deserialize it (pickling and unpickling). For eg: here is an array, pickled.

python3
>>> import pickle
>>> variable = pickle.dumps([1,2,3])
>>> print(variable)
b'\x80\x04\x95\x0b\x00\x00\x00\x00\x00\x00\x00]\x94(K\x01K\x02K\x03e.'
>>> pickle.loads(variable)
[1, 2, 3]
>>>

As we can see above, when we print the variable, we see a byte string. This is serialization. Later, with pickle.loads(variable) we are deserializing the object.

This is helpful in many cases, including when we want to save some variables from a program on the drive as a binary which can be later used in other programs. For example, let’s create an array and save it as a binary file.

import pickle
variable = pickle.dumps([1,2,3])
with open("myarray.pkl","wb") as f:
f.write(variable)

As we can see, a pickle binary is now stored on the drive. Let’s read it using pickle again.

import pickle
obj = open("myarray.pkl","rb").read()
pickle.loads(obj)

As you can see, we can now operate on this deserialized object (obj) just like an array again! Throughout the SDLC, there may come a time when a developer would want to quit the IDE and save all the data and states of variables at the moment, that is where this is a helpful feature.

Serialization in Web Apps

Okay, so we have talked about serialization in software applications. But what is the use of serialization in web apps? So, the HTTP is a stateless protocol. That is, the state of one request doesn’t depend on the previous request. But sometimes there is a need to maintain state. That’s why we have cookies. Cookies would bring a sense of statefulness to HTTP protocol.

If we want a user’s information and some data to be retained the next time they interact with the server, serialization is a wonderful use case. Just serialize some data, put it into a cookie (which is taking up the user’s storage and not the server’s! WoW) and for the next request just deserialize it and use it on the site.

Pickle is used in Python web apps to do this. But one caveat is that it deserializes unsafely and its content is controlled by the client. Just adding, serialization in json is much safer! Unlike some other serialization formats, JSON doesn’t allow executable code to be embedded within the data. This eliminates the risk of code injection vulnerabilities that can be exploited by malicious actors.

It is possible to construct malicious pickle data which will execute arbitrary code!

Over Pickling

We have talked about pickling well-known data types like an array. But what if we were to pickle our own custom classes? Python can easily understand and deserialize well-known classes but what will it do with custom classes like connection to servers and all those fancy networking scripts? It doesn’t even make sense to serialize those but Python developers added a way to pickle that too. There is a chance that discrepancies might happen when Python tries to deserialize such objects.

Custom pickling and unpickling code can be used. When you define a class you can provide a mechanism that states, ‘Here is what you should do when someone asks to unpickle you!’ So when Python goes to unpickle this string of bytes, it might have to run some code to figure out how to properly reconstruct that object. This code will be embedded in this pickle file.

Let’s see a small example.

Here is a code for proof of concept. This code is creating a class called EvilPickle. To implement support for pickling on your custom object, you define a method called “__reduce__” which returns a function and pair of arguments to call that function with. Here, a simple “cat /etc/passwd” would be run using os.system function. Finally, this would be written in a binary file called backup.data.

python
import pickle
import os
class EvilPickle(object):
  def __reduce__(self):
    return (os.system, ('cat /etc/passwd', ))
pickle_data = pickle.dumps(EvilPickle())
with open("backup.data", "wb") as file:
  file.write(pickle_data)

The idea here is to make the deserializer run cat /etc/passwd on their system. Let’s try it out now! We save the above code in evilpickle.py file and run it. Just to check, we’ll cat the backup.data file. Here we can clearly see something fishy!

The user deserializes it anyway and ends up giving out /etc/passwd file.

python
import pickle
pickle.loads(open("backup.data","rb").read())

We can get even more nerdy and see what is happening under the hood by disassembling using pickletools. Here, the pickling is done on Unix like os (posix) which is stored in a SHORT variable and stored in as 0 and each successive command after that in different numeric values on the stack. The `REDUCE` opcode is used to call a callable (typically a Python function or method, here os.system (represented as posix and system)) with arguments (called TUPLE. here, cat /etc/passwd). And finally, the program is stopped.

The primary difference between tuples and lists is that tuples are immutable as opposed to lists which are mutable. Therefore, it is possible to change a list but not a tuple. The contents of a tuple cannot change once they have been created in Python due to the immutability of tuples.

python3 -m pickletools -a backup.data

note: -a options gives some info about each steps while using pickletools

So since the pickle object is user-controlled and it unpickles at the server, we can even use this to get the remote server shell as well (using sockets and pickling it and finally providing it to the server)

PyTorch ML model up until recent times used pickle for serialization of ML models and was vulnerable to arbitrary code execution. Safetensors overcame this issue.

Python YAML vs Python Pickle

Python YAML is another serialization format instead of pickle. But even Python YAML allows the execution of arbitrary code by default. Here is another POC:

import yaml
document = "!!python/object/apply:os.system ['cat /etc/passwd']"
yaml.load(document)

This would also execute cat /etc/passwd. We can avoid this by using “safe_load()” instead of load() anyway!

Mitigation

Pickle is just one module in Python. This is a very well-known tool and developers use it still but if the developers are a little more mindful, they’ll not ignore the warning shown below on pickle’s documentation page:

Alternatives to pickle and brief POCs on them are as follows:

JSON

import json
# Serialize
data = {"key": "value"}
json_data = json.dumps(data)


# Deserialize
deserialized_data = json.loads(json_data)

msgpack

import msgpack
# Serialize
data = {"key": "value"}
msgpack_data = msgpack.packb(data)


# Deserialize
deserialized_data = msgpack.unpackb(msgpack_data, raw=False)

Some other safe options to use would be protobuf by Google, CBOR.

Demonstration

Okay, so the given website is a note-taking website which is using serialization. Here is what happens when I submit a note with a PNG image.

This looks something like this when processed by the server. Observe the URL which is rendering a .pickle file

The challenge also provided us with an app.py source code which tells us all about the background logic. I can’t post the entire code but here are some relevant snippets.

As we can see, the code is accepting the title, content and image as an object, pickling it and storing it in title.pickle

Here are the key functions of the code:

  1. Note() class accepts an object new_note with 3 items: title, content, image_filename.
  2. save_note() is calling pickle.dumps() to pickle new_note. save_note() is also called to store an image using image.save which is a flask function. Similarly image.filename extracts image’s filename.
  3. secure_filename() function converts insecure names to secure ones. For example: note 1 becomes note_1, ../../../etc/passwd becomes etc_passwd
  4. unpickle_file is loading the pickled file provided to it and unpickles it.

Here are some key takeaways about the functionality of the code:

  1. The site is accepting 3 key items.
  2. It is not checking if PNG is safe or not (as in if it is a valid PNG or not. This is a good attack point)
  3. All in all, PNG file upload is a really strong contender to put code in because: a, site isn’t validating safety of PNG and b, it will unpickle any file we provide.

I tried with a simple cat /etc/passwd command on my local machine and the evil.png pickled file was deserializing properly!

import pickle
import os
class EvilPickle(object):
  def __reduce__(self):
    return (os.system, ('cat /etc/passwd', ))
pickle_data = pickle.dumps(EvilPickle())
with open("evil.png", "wb") as file:
  file.write(pickle_data)

Let’s take it a step further and use a netcat listener to receive data from the deserialized local execution of evil.png and have it give us a shell!

By following the same logic, we could exploit the server. First I create a PNG file and upload it on the server.

The uploaded data becomes a pickle file which gets stored on the server and when it is called, data is visible on the screen (it is unpickled).

Finally, we access the uploaded PNG file on the server.

We get a reverse shell on the netcat listener we set up this way!

This is how we root the box! Please note that I hid and altered a few details throughout the CTF section of the article because the CTF is still an ongoing challenge and I couldn’t obtain permission to post a complete solution.

Conclusion

Serialization vulnerabilities are easy to exploit and easy to overlook by developers. One can even achieve arbitrary code execution on machines. As we saw, when deserialization insecurely or by using insecure functions, we put our infrastructure at risk for compromise. Developers should carefully read the documentation page and not ignore warnings. Finally, use languages like json to serialize/deserialize data which can’t be used to contain executable code since it is a data-only language. Thanks for reading.

Author: Harshit Rajpal is an InfoSec researcher and left and right-brain thinker. Contact here