tl;dr

I’ve been comparing crates on crates.io against their upstream repositories in an effect to detect (and, ultimately, help prevent) supply chain attacks like the xz backdoor1, where the code published in a package doesn’t match the code in its repository.

The results of these comparisons for the most popular 9992 crates by download count are now available. These come with a bunch of caveats that I’ll get into below, but I hope it’s a useful starting point for discussing code provenance in the Rust ecosystem.

No evidence of malicious activity was detected as part of this work, and approximately 83% of the current versions of these popular crates match their upstream repositories exactly.

Background

I’ve been employed by the Rust Foundation3 to work on security matters for a bit over a year now. My focus has mostly been on the crate ecosystem thus far, and especially around supply chain security.

After the xz backdoor1, one question that immediately came up was “could a crate be compromised in the same way?”. Perhaps an even more urgent question was “is a crate compromised in the same way?”. We need tooling to be able to answer those questions in an ongoing way.

Analysis

I’ve built a standalone tool to analyse crate versions, which I intend to eventually turn into something that performs ongoing analysis of newly published crate files. (Once that’s ready, this will be open sourced — the current tool relies on personal tooling I’ve built to mirror crates.io, which is (a) ugly as hell, and (b) not useful to anyone except me in its current form. The methodology is described below, though.)

For now, I’ve analysed the top 9992 crates on crates.io.

As part of that, I also built a rough and ready tool to visualise the results for spot checking purposes, which I’ve now turned into a static site generator and have used to publish the results of that analysis.4

Methodology

The short version here is:

  1. Take every version of each crate.
  2. See if the manifest defines a repository.
  3. See if the crate file includes VCS metadata.
  4. Clone the repository at the given revision.
  5. Run cargo package to rebuild the crate file.5
  6. See if it matches!

Simple, right? Right?

Well, as it turns out, there are a few issues.

Issues

There are, in fact, a bunch of ways the above can fail.

This isn’t an issue for the vast majority of crates, but I spent a fair bit of time tracking this down.

Basically, if you have a symlink in your repo, and you clone it on Windows without symlink support enabled6, Git will helpfully turn each symlink into a regular file. That file’s contents will be the target of the symlink.

Now, if it’s a source file, you’re probably going to notice right away (since your builds will fail), but for things like README files and licences, you probably won’t. And, to make matters worse, they’re pretty much the most commonly symlinked files, particularly in multi-crate workspaces.

For now, I’ve elected to give crate versions that otherwise match their repos a special yellow sort-of-OK state. Still, it’s not lost on me that this might be a potential vector of attack in the future. Realistically, the fix here is probably to encourage crate developers to publish their crates out of (non-Windows) CI. (More on that in a bit.)

Stuff straight up not existing

Just because you declare a repository in a Cargo manifest, doesn’t mean that the repo still exists. (Or, indeed, ever existed.) And that’s before even getting to submodules. Or revisions — just because Cargo saw a revision in a local repository doesn’t mean it ever got pushed to a public code host. It just has to be committed locally to avoid needing to use --allow-dirty.

Speaking of…

Dirty crates

If you publish with something like cargo publish --allow-dirty, then that lovely VCS info file doesn’t get included in the crate file.

This is probably the right choice on the Cargo side, but I do feel that we’ll lose the ability to otherwise verify repos in some cases where someone just has an extra set of test cases in their working directory and used --allow-dirty to get around the requirement that the Git tree is clean.

Build failures

This isn’t really an issue in the top crates, but in testing the deeper reaches of the crate ecosystem, some crate versions just straight up don’t build any more — presumably because they relied on submodules that no longer populate correctly, or because they relied on nightly features that no longer exist.

It’s hard to generate a crate file to test when the crate doesn’t build.

Workspaces, again

I mentioned workspaces earlier, but another problematic case is users of workspaces on older versions of Cargo. Before Cargo 1.57 (December 2021), crate files built from subdirectories of repositories didn’t have that fact annotated in their VCS info.

In theory, it would be possible to search the repository to try to discover where a member crate is built from. Alas, that exercise has been left for another day, so those crates will currently show up as not being found in the repository.

Very old things

And, finally, Cargo added support for generating the VCS info file in version 1.30, which was released in October 2018. Very old crate versions simply won’t have this file, and hence can never be verified.

Results

Given the above, here’s what I found. Of the most recent versions of the top 999 crates:

  • 826 crates match their upstream repositories at the revision they were built at.
  • 74 crates have revisions that cannot be found in their repositories, whether due to later squash merges, rebases or revisions simply not being pushed.
  • 73 crates do not have VCS info, either because they were built with old Cargo versions, built with --allow-dirty, or not built from a repo clone at all.7
  • 7 crates do not declare a repository in their Cargo manifest.
  • 7 crates would match their upstream repository but for one or more symlinks being incorrectly handled.
  • 3 crates declare repositories that do not exist.
  • 3 crates have submodules that do not exist.
  • 3 crates cannot be found within their repositories.
  • 3 crates cannot be built due to cargo package errors.

Going back further, those 999 crates have published 33,085 versions in total. The major trends looking back further into history are that fewer crates have repository metadata, and there are more errors related to not being able to find a crate in a workspace and more missing repositories. Both of these feel intuitively correct: the further back into history we go with these crates, the more likely it is that they have were packaged with older versions of Cargo, and the more likely it is that their repository history has shifted in ways that we can’t unpick in 2024.

Only 8 crate versions straight up don’t match their upstream repositories. None of these were malicious: seven were updates from vendored upstreams (such as wrapped C libraries) that weren’t represented in their repository at the point the crate version was published, and the last was the inadvertent inclusion of .github files that hadn’t yet been pushed to the GitHub repository.

Future work

An obvious next step here is to extend this to the entire crates.io corpus. I intend to perform this analysis in the next couple of weeks.

Rather than further extending the static site that I’ve published today, I would also like to integrating this into crates.io for every crate, and running this check each time a new crate version is published. Doing this will, of course, require the consensus of the crates.io team, and work to design the UI and UX for this in a way that is immediately useful to the casual Rust user.

I think it’s also critical that we start providing off-the-shelf GitHub actions (and equivalents for other popular code hosts) that make it easier to publish directly out of repositories on the host, rather than crates being published from developer desktops. This is also a critical step on the road towards supporting a full trusted publishing pipeline.

And, of course, there’s plenty that can be done to improve the analysis.

The handling of broken symlinks is a late-added heuristic that I’m still not 100% sure I like.

Discovering crates within workspaces published from old Cargo versions would improve the accuracy of the checks.

Finally, getting a better idea of what types of changes exist is also important: analysing the top 999 crates didn’t really result in enough crate versions that didn’t match their upstreams to perform any real analysis, but a larger dataset will likely give us a lot to dig into. This is important because it will allow us to develop tailored best practice advice for different real world scenarios.

In summary

If there’s a backdoor attack lurking in the crates ecosystem, then it’s lurking pretty deep at present. The popular crates that we all rely on day to day generally appear to be what they say they are.

Of course, just because a package is the same as its upstream repository, that doesn’t mean that the repository itself is safe. This just mitigates one potential area of supply chain interest. (Alas, there are no silver bullets.)

I’m looking forward to developing this work further in conjunction with the Rust project, Rust Foundation, crates.io team, and others, but also ensuring that we broaden the analysis and scanning work that we do as we go.

Lots to be done!

  1. I doubt this needs any introduction, but Wikipedia has a decent summary 2

  2. Why 999? There are just under 150k crates on crates.io, but as with most package registries, their popularity follows a pretty standard log curve where the most popular crates are really popular, and there’s a very long tail of crates that are rarely or never used by other projects.

    As to why it’s 999 and not 1000, you can decide if it’s because 999 is a more memorable number, because it allowed for the joke in the title, or because I had an off-by-one error in the shell pipeline that generated the candidate list.

    Finally, “top” is being defined by download count in the last 90 days for now.  2

  3. My thanks to the Rust Foundation for supporting this work in conjunction with the OpenSSF’s Alpha Omega project. My thanks also to the crates.io team, who continue to support and encourage security-related experimentation. 

  4. Two notes on this:

    1. I am not a frontend developer. I used to be many years ago, but anyone who’s worked with me in the last decade will agree that I shouldn’t be allowed near any frontend technology. So go easy!

    2. The overall bundle and data size is kind of large — it all weighs in at just under 1 MB to render the index, and then there’s another ~180 MB of fine-grained data that gets loaded on a per-crate basis. Don’t load this on a heavily metered mobile connection.

  5. cargo package --list isn’t usable because it’s currently ambiguous where files come from in workspace scenarios. cargo issue #11666 would add a JSON output mode that would make actually packaging unnecessary, so I’m crossing my fingers that that can be landed at some point.

    (And no, I’m not ruling out going and implementing it myself at some point.) 

  6. It would probably be possible to figure out a heuristic to split these, but I don’t think the answer is particularly interesting, honestly: the only one of these cases that is really actionable is the --allow-dirty case, and we can address that through better tooling to publish crates from CI.