`cudaPackages` Feb 13th, 2025
===
Time: https://www.timeanddate.com/worldclock/fixedtime.html?msg=cudaPackagesNG&iso=20250213T14&am=50
Location: https://matrix.to/#/#cuda:nixos.org
Agenda
---
- [x] Optional: introductions, updates
- [x] Describe the problematic `{ cudaSupport, ... }: if cudaSupport then ... else ...` pattern
- [ ] Any damage beyond conditional `stdenv`?
- [x] Does the pattern reoccur with the nvcc-wrapper approach?
- It does not
- [ ] `stdenv.cudaSupport` - does it make sense?
- [ ] `crossSystem.cudaSupport`/`hostPlatform.cudaSupport` - does it make sense?
- [ ] Mass rebuild costs?
- [x] `ConnorBaker/cuda-packages`
- [ ] Any downsides to the nvcc-wrapper approach?
- [x] What is the LTO issue?
- Object bitstreams produced by different compiler versions can't be LTO'd together
- [ ] LTO aside, do linkers even officially support object files produced by different compilers?
- [ ] Any new insights on in-tree v. out-of-tree development?
- [x] Encourage NixOS Foundation to seek exemption from NVIDIA's license clause barring redistribution of patched executables
- Also spam nvidia employees with https://github.com/NVIDIA/build-system-archive-import-examples/issues/10
- Should also seek an exemption for nix-community
- [x] Investment in community-owned/operated build infastructure to include CUDA packages by default/make GPU builders available
- [ ] Packaging of `libcuparselt` (from @GaetanLepage)
- Currently blocking the [pytorch update](https://github.com/NixOS/nixpkgs/pull/377785#issuecomment-2624721749)
- [ ] Optional: backlog. Attendees walk through the CUDA Team's backlog, tickets are explained by the members familiar with the ticket's context, tickets are sorted appropriately (e.g. when out-dated or already resolved)
Meeting notes
---
- 14:02 The meeting begins. Introductions. Five attendees: SomeoneSerge, connor, GaetanLepage, ruro, yorik.sar
- 14:07: Updates
- Connor: finalized most of my stuff that I've been working on out-of-tree; before the release; finalize documentation, diff to upstream, plan a way to merge things upstream incrementally to get it in 25.05
- SomeoneSerge: started working as a self-employed person
- very interested in supporting Nixpkgs/CUDA ecosystem for HPC and have several customers that shall not yet be named who are interested in improving the stability of the ecosystem
- will try within the coming month to re-write the way CUDA packages uses `evalModules` so that it to calls `evalModules` only once rather than per-CUDA package set
- also planning to make sure the codegen part is optional
- will require coordination with Connor with the out of tree CUDA packaging
- Gaetan comments: the state of python that aren't too bad, although we have migrated jaxlib/tf to wheels; not cuda issues but bazel issues; zeuner working on tf (SomeoneSerge: but what is the status?); pytorch, new release, unpackaged `libcuparselt`
- SomeoneSerge `fetchSubmodules = false;` to remove the need for vendored dependencies; bazel: split out openxla (at least)
- connor: source build very important, customization, targetting specific devices
- 14:18
- Serge: backwards that we delegate the responsibility of switching between `stdenv` to the downstream consumer, very easy to forget to use the right `stdenv` conditioned on `cudaSupport`
- Also possibly inconsistent that we have the `cudaSupport` attribute in `config`; proposal that we should move it to the `*Platform` attribute; would allow us to create `pkgsCuda`; use `stdenv` with a `cudaSupport` attribute, but that might cause rebuilds because we're using a different `stdenv`; on replacing `cudaStdenv` with NVCC hooks
- connor: `effectiveStdenv = if cudaSupport then else` is a major issue for out-of-tree/downstream consumers
- nvcc wrapper that uses "nvcc-gcc"
- setup-hook to test for leaking references (retaining links to wrong libraries)
- LTO: error at linkage time, "bitstream versions are different"; presumably, incompatible object files generated by different (major) versions of compilers
- OK for current versions of nvcc, gcc
- `pkgsCuda = pkgs.extend overlay` ("unprincipled", but works)
- Gaetan/connor cudaCapabilities list v. attrset
- ruro: how would this work with nix-community hydra builder? (nix-community?) not willing to spend compute on building individual capabilities?
- connor: my `pkgsCuda` doesn't affect the default attrset; ...; availability of CI to build things; ideally would be able to fetch exactly what's needed, e.g. for closure sizes; would like to (ab)use separable compilation; alt.: build everything and strip out irrelevant architectures
- gaetan: which infra builds for nix-community cache?
- connor: nix-community hydra
- gaetan: with this change (connorbaker/cuda-apckages) would nix-community need to build for each architecture separately?
- connor: no, the change only affects `pkgsCuda`, not the top-level `pkgs`
- someoneserge:
- ...
- Serge comment on hydra community not willing to spend time building for individual GPU capabilities: there hasn't been an official statement from the community and chose to build with the defaults because it was the simplest to implement and begin chasing down build failures and what most consumers get out of the box unless they know how to configure it; however, there's no reason we couldn't build for individual GPU capabilities, but there are concerns that if we spawn too many instances of CUDA packages it will starve everything else of resources, though it should be possible to put them in another queue; we should advertise to customers that we have these packages cached
- Serge: one of the recurrent subjects is build/CI infrastructure; customers don't care about the binary cache at all, but they do want to see CI which would catch build failures or notify of breaking changes, because even if they customize and rebuild everything, they still need a stable base layer; there is an interest in having GPU tests available and running on specific hardware; as Connor said we need community-owned infrastructure to ensure CI is usable long-term; it makes sense to go forward with nix-community, which rents from Hetzner; would be better if it supported spot instances; could also work to buy hardware through a nonprofit and colocate it; talked with Matthew Croughan and he modestly said that he could easily afford 4kw and scale to 10kw at the hackerspace he's making in Liverpool; it's fairly easy at this point to build a CI which takes revisions from master or a rolling release branch and tells us what is broken, but it's much more important to build a CI which operates on push; that's tricky though because it requires us to integrate with the upstream Nixpkgs; or a third-party service which allows parties to subscribe to branches or PRs and get notifications about things which break (e.g. build failures or test failures), essentially an automatic OfBorg to expose authors to breakages they cause (automatic posting in PRs)
- Yorik: starting with the assumption that we will not be able to do this with the NixOS foundation?
- Serge: yes, with respect to unfree licenses or specialized hardware, due either to FOSS principles or drama, and perhaps specialized hardware is out of scope of the foundation, and better aligned with nix-community; this could also be a step toward decentralizing OfBorg and moving toward federated CI
- Yorik: main problem is covering test metrics actual users are interested in, covering typical use cases and making sure we don't break them. If I understand correctly, we need to 1. build stuff and 2. test stuff. What if we could build non-redistributable things in upstream Hydra, but redistribute the results without the unfree parts?
- Serge: hydra is there to populate the cache; infra team is trying to separate the staging jobsets so it uses its own cache; hydra has its own cache, we could add things to that; we had issues to do something similar to build in hydra but issues were closed; testing is the main interest of companies (because they have their own infra), caching only impacts individual users
- Yorik: OfBorg could avoid rebuilding things that are already built in Hydra, even if they're not redistributable
- Serge: OfBorg does consume the official binary cache, it just doesn't publish its own results; I don't see why it would be more efficient to implement this on the OfBorg side rather than the nix-community side
- connor: while companies will typically rebuild the world, ... they build on what upstream has; testing and CI are the value for companies; been OK'd to engage with NixOS Foundation to get NVidia's patchelf exception in writing. Legal cover is important. OK with "using legal council to have these interactions". Getting an exception from nvidia would be a step towards this infrastructure. Will be trying to organize, in the cuda chat, the interested parties to reach out with NVidia. Agrees it could be out of scope for NixOS foundation. Agrees need an entity with clear and transparent funding and structure to implement this infra/CI. Would like to avoid directing people to a different place, which would feel like a split in community.
- 15:00 Yorik: wanted to mention, our sales were talking to our clients about figuring out how to deal with nvidia; not sure what's needed from the foundation or the community, but feel like would be easier to convince people to do such legal act on behalf of a bigger foundation than a smaller organization. Would be useful for people to just use the default cache. I would like us to avoid going to a different place to for CUDA support because it presents as a split in the community
- Geatan: Question about license excpetion. What we do in nix-community rn is. Is that OK license-wise? Would the excpetion
- Connor: No, it is not ;)
- Serge: Tom Berek has tried reaching out (I think) as Flox; I agree with Connor, we need to coordinate this action with coordinate parties, it's great if this can be coordinated with an actual lawyer; it sounds like Tweag's customers can be directed toward this effort as well; another reply is with respect to Connor's words on community-owned infrastructure and transparent funding: should the organization be the nix community, or should we set up a separate nonprofit with a separate opencollective dedicated to accelerated devices in general? benefit of nix community is that it already exists and has numtide behind it, which is convenient; regarding Gaétan's question about the exemption and what we're doing with the nix-community cache: we're violating the non-modification clause in the license
- Yorik: questions about the current nix-community structure.
- Serge: Jonas Chevellier is a member of the nix-community organization and the organization they have a matrix chat and infra team; team is basically zovoq (sp?) and Mic92 to rent Hetzner machines and cachix
- Gaétan: infra is split in two; buildbot etc. for nix-community projects; the other set of builders is provided to members of the community to enable contributions
- Connor: we're overtime, but question about tooling; in terms of ... nixpkgs-review, hydra, ofborg; discourse about "cloud-scale hydra", dropped out of lack of interest; ephemeral builders would be extremely helpful (SomeoneSerge concurs)
- SomeoneSerge replies. Unfamiliar with hydra, interested in organizing the effort to implement support.
- Connor: elaborates on issues of hydra, perl, ~~and C++ written by people trying to avoid C++~~. Proposes "survey of the landscape" as the first step.
Conclusion
---
- "Survey of the landscape" (as per connor) re: ephemeral builders. Assigned to SomeoneSerge for now.
- "Choosing the right non-profit for infra": discuss alignment with the nix-community team. Consensus re: scaling up for wider test coverage. Consens re: scaling up may compete with other nix-community projects. Consensus re: non-profit/community ownership and transparency. Consensus re: value for the companies and funding.
- "cudaSupport, config.cudaSupport interface": consensus it's an issue, no decision how to proceed;
- "patchelf exception": consensus, coordinate the parties for collective action. Assigned to connor.
- libcusparselt: implemented in connor's out-of-tree package set; start by borrowing the existing solution
- re: updates
- re: refactoring cudaPackages' use of evalModules while also merging ConnorBaker/cuda-packages; consensus that this might need coordination
- 15:21 UTC 20' overtime. Didn't discuss the backlog. Haven't decided on the time for the next meeting (left for the chat)
Later notes and additions
---
2025-05-14 11:16 UTC @SomeoneSerge:
> By the way, while on my side I'm advertising both options for provisioning hardware, the spot instances and the owned hardware, I think we might want to incentivize companies to commit to support the latter path. While it's obviously more work, organisational and engineering, it is a much better long-term promise for the community. With the rented hardware, if two or three companies simultaneously decide to withdraw, we basically have to immediately scale down the CI. If we buy hardware for a non-profit and a few years later some companies decide they're not interested anymore, we maybe lose a retainer covering the maintenance work. With own hardware we can also be more flexible and maybe dedicate some machines to be used as community builders/devboxes for ad hoc experimentation.