Conda: Myths and Misconceptions

I’ve spent much of the last decade using Python for my research, teaching Python tools to other scientists and developers, and developing Python tools for efficient data manipulation, scientific and statistical computation, and visualization. The Python-for-data landscape has changed

immensely since I first installed NumPy and SciPy from via a flickering CRT display. Among the new developments since those early days, the one with perhaps the broadest impact on my daily workhas been the introduction of conda, the open-source cross-platform package manager first released in 2012. In the four years since its initial release, many words have been spilt introducing conda and espousing its merits, but one thing I have consistently noticed is the number of misconceptions that seem to remain in the (often fervent) discussions surrounding this tool. I hope in this post to do a small part in putting these myths and misconceptions to rest.

I’ve tried to be as succinct as I can, but if you want to skim this article and get the gist of the discussion, you can read each heading along with the the bold summary just below it.

Image result for conda

Myth #1: Conda is a distribution, not a package manager

Reality: Conda is a package manager; Anaconda is a distribution. Although Conda is packaged with Anaconda, the two are distinct entities with distinct goals.

A software distribution is a pre-built and pre-configured collection of packages that can be installed and used on a system. A package manager is a tool that automates the process of installing, updating, and removing packages. Conda, with its “conda install“, “conda update“, and “conda remove” sub-commands, falls squarely under the second definition: it is a package manager.

Perhaps the confusion here comes from the fact that Conda is tightly coupled to two software distributions: Anaconda and Miniconda. Anaconda is a full distribution of the central software in the PyData ecosystem, and includes Python itself along with binaries for several hundred third-party open-source projects. Miniconda is essentially an installer for an empty conda environment, containing only Conda and its dependencies, so that you can install what you need from scratch.

But make no mistake: Conda is as distinct from Anaconda/Miniconda as is Python itself, and (if you wish) can be installed without ever touching Anaconda/Miniconda. For more on each of these, see the conda FAQ.

Myth #2: Conda is a Python package manager

Reality: Conda is a general-purpose package management system, designed to build and manage software of any type from any language. As such, it also works well with Python packages.

Because conda arose from within the Python (more specifically PyData) community, many mistakenly assume that it is fundamentally a Python package manager. This is not the case: conda is designed to manage packages and dependencies within any software stack. In this sense, it’s less like pip, and more like a cross-platform version of apt or yum.

If you use conda, you are already probably taking advantage of many non-Python packages; the following command will list the ones in your environment:

$ conda search --canonical  | grep -v 'py\d\d'

On my system, there are 350 results: these are packages within my Conda/Python environment that are fundamentally unmanageable by Python-only tools like pip & virtualenv.

Myth #3: Conda and pip are direct competitors

Reality: Conda and pip serve different purposes, and only directly compete in a small subset of tasks: namely installing Python packages in isolated environments.

Pip, which stands for Pip Installs Packages, is Python’s officially-sanctioned package manager, and is most commonly used to install packages published on the Python Package Index (PyPI). Both pip and PyPI are governed and supported by the Python Packaging Authority (PyPA).

In short, pip is a general-purpose manager for Python packages; conda is a language-agnostic cross-platform environment manager. For the user, the most salient distinction is probably this: pip installs python packages within any environment; conda installs any package within condaenvironments. If all you are doing is installing Python packages within an isolated environment, conda and pip+virtualenv are mostly interchangeable, modulo some difference in dependency handling and package availability. By isolated environment I mean a conda-env or virtualenv, in which you can install packages without modifying your system Python installation.

Even setting aside Myth #2, if we focus on just installation of Python packages, conda and pip serve different audiences and different purposes. If you want to, say, manage Python packages within an existing system Python installation, conda can’t help you: by design, it can only install packages within conda environments. If you want to, say, work with the many Python packages which rely on external dependencies (NumPy, SciPy, and Matplotlib are common examples), while tracking those dependencies in a meaningful way, pip can’t help you: by design, it manages Python packages and only Python packages.

Conda and pip are not competitors, but rather tools focused on different groups of users and patterns of use.

Myth #4: Creating conda in the first place was irresponsible & divisive

Reality: Conda’s creators pushed Python’s standard packaging to its limits for over a decade, and only created a second tool when it was clear it was the only reasonable way forward.

According to the Zen of Python, when doing anything in Python “There should be one – and preferably only one – obvious way to do it.” So why would the creators of conda muddy the field by introducing a new way to install Python packages? Why didn’t they contribute back to the Python community and improve pip to overcome its deficiencies?

As it turns out, that is exactly what they did. Prior to 2012, the developers of the PyData/SciPy ecosystem went to great lengths to work within the constraints of the package management solutions developed by the Python community. As far back as 2001, the NumPy project forked distutils in an attempt to make it handle the complex requirements of a NumPy distribution. They bundled a large portion of NETLIB into a single monolithic Python package (you might know this as SciPy), in effect creating a distribution-as-python-package to circumvent the fact that Python’s distribution tools cannot manage these extra-Python dependencies in any meaningful way. An entire generation of scientific Python users spent countless hours struggling with the installation hell created by this exercise of forcing a square peg into a round hole – and those were just ones lucky enough to be using Linux. If you were on Windows, forget about it. To read some of the details about these pain-points and how they led to Conda, I’d suggest Travis Oliphant’s 2013 blog post on the topic.

But why didn’t Conda’s creators just talk to the Python packaging folks and figure out these challenges together? As it turns out, they did.

The genesis of Conda came after Guido van Rossum was invited to speak at the inaugural PyData meetup in 2012; in a Q&A on the subject of packaging difficulties, he told us that when it comes to packaging, “it really sounds like your needs are so unusual compared to the larger Python community that you’re just better off building your own” (See video of this discussion). Even while following this nugget of advice from the BDFL, the PyData community continued dialog and collaboration with core Python developers on the topic: one more public example of this was the invitation of CPython core developer Nick Coghlan to keynote at SciPy 2014 (See video here). He gave an excellent talk which specifically discusses pip and conda in the context of the “unsolved problem” of software distribution, and mentions the value of having multiple means of distribution tailored to the needs of specific users.

Far from insinuating that Conda is divisive, Nick and others at the Python Packaging Authority officially recognize conda as one of many important redistributors of Python code, and are working hard to better enable such tools to work seamlessly with the Python Package Index.

Myth #5: conda doesn’t work with virtualenv, so it’s useless for my workflow

Reality: You actually can install (some) conda packages within a virtualenv, but better is to use Conda’s own environment manager: it is fully-compatible with pip and has several advantages over virtualenv.

virtualenv/venv are utilites that allow users to create isolated Python environments that work with pip. Conda has its own built-in environment manager that works seamlessly with both conda and pip, and in fact has several advantages over virtualenv/venv:

  • conda environments integrate management of different Python versions, including installation and updating of Python itself. Virtualenvs must be created upon an existing, externally managed Python executable.
  • conda environments can track non-python dependencies; for example seamlessly managing dependencies and parallel versions of essential tools like LAPACK or OpenSSL
  • Rather than environments built on symlinks – which break the isolation of the virtualenv and can be flimsy at times for non-Python dependencies – conda-envs are true isolated environments within a single executable path.
  • While virtualenvs are not compatible with conda packages, conda environments are entirely compatible with pip packages. First conda install pip, and then you can pip install any available package within that environment. You can even explicitly list pip packages in conda environment files, meaning the full software stack is entirely reproducible from a single environment metadata file.

That said, if you would like to use conda within your virtualenv, it is possible:

$ virtualenv test_conda

$ source test_conda/bin/activate

$ pip install conda

$ conda install numpy

This installs conda’s MKL-enabled NumPy package within your virtualenv. I wouldn’t recommend this: I can’t find documentation for this feature, and the result seems to be fairly brittle – for example, trying to conda update python within the virtualenv fails in a very ungraceful and unrecoverable manner, seemingly related to the symlinks that underly virtualenv’s architecture. This appears not to be some fundamental incompatibility between conda and virtualenv, but rather related to some subtle inconsistencies in the build process, and thus is potentially fixable (see conda Issue 1367 and anaconda Issue 498, for example).

If you want to avoid these difficulties, a better idea would be to pip install conda and then create a new conda environment in which to install conda packages. For someone accustomed to pip/virtualenv/venv command syntax who wants to try conda, the conda docs include a translation table between conda and pip/virtualenv commands.

Myth #6: Now that pip uses wheels, conda is no longer necessary

Reality: wheels address just one of the many challenges that prompted the development of conda, and wheels have weaknesses that Conda’s binaries address.

One difficulty which drove the creation of Conda was the fact that pip could distribute only source code, not pre-compiled binary distributions, an issue that was particularly challenging for users building extension-heavy modules like NumPy and SciPy. After Conda had solved this problem in its own way, pip itself added support for wheels, a binary format designed to address this difficulty within pip. With this issue addressed within the common tool, shouldn’t Conda early-adopters now flock back to pip?

Not necessarily. Distribution of cross-platform binaries was only one of the many problems solved within conda. Compiled binaries spotlight the other essential piece of conda: the ability to meaningfully track non-Python dependencies. Because pip’s dependency tracking is limited to Python packages, the main way of doing this within wheels is to bundle released versions of dependencies with the Python package binary, which makes updating such dependencies painful (recent security updates to OpenSSL come to mind). Additionally, conda includes a true dependency resolver, a component which pip currently lacks.

For scientific users, conda also allows things like linking builds to optimized linear algebra libraries, as Continuum does with its freely-provided MKL-enabled NumPy/SciPy. Conda can even distribute non-Python build requirements, such as gcc, which greatly streamlines the process of building other packages on top of the pre-compiled binaries it distributes. If you try to do this using pip’s wheels, you better hope that your system has compilers and settings compatible with those used to originally build the wheel in question.

Myth #7: conda is not open source; it is tied to a for-profit company who could start charging for the service whenever they want

Reality: conda (the package manager and build system) is 100% open-source, and Anaconda (the distribution) is nearly there as well.

In the open source world, there is (sometimes quite rightly) a fundamental distrust of for-profit entities, and the fact that Anaconda was created by Continuum Analytics and is a free component of a larger enterprise product causes some to worry.

Let’s set aside the fact that Continuum is, in my opinion, one of the few companies really doing open software the right way (a topic for another time). Ignoring that, the fact is that Conda itself – the package manager that provides the utilities to build, distribute, install, update, and manage software in a cross-platform manner – is 100% open-source, available on GitHub and BSD-Licensed. Even for Anaconda (the distribution), the EULA is simply a standard BSD license, and the toolchain used to create Anaconda is also 100% open-source. In short, there is no need to worry about intellectual property issues when using Conda.

If the Anaconda/Miniconda distributions still worry you, rest assured: you don’t need to install Anaconda or Miniconda to get conda, though those are convenient avenues to its use. As we saw above, you can “pip install conda” to install it via PyPI without ever touching Continuum’s website.

Myth #8: But Conda packages themselves are closed-source, right?

Reality: though conda’s default channel is not yet entirely open, there is a community-led effort (Conda-Forge) to make conda packaging & distribution entirely open.

Historically, the package build process for the default conda channel have not been as open as they could be, and the process of getting a build updated has mostly relied on knowing someone at Continuum. Rumor is that this was largely because the original conda package creation process was not as well-defined and streamlined as it is today.

But this is changing. Continuum is making the effort to open their package recipes, and I’ve been told that only a few dozen of the 500+ packages remain to be ported. These few recipes are the only remaining piece of the Anaconda distribution that are not entirely open.

If that’s not enough, there is a new community-led – not Continuum affiliated – project, introduced in early 2016, called conda-forge that contains tools for the creation of community-driven builds for any package. Packages are maintained in the open via github, with binaries automatically built using free CI tools like TravisCI for Mac OSX builds, AppVeyor for Windows builds, and CircleCI for Linux builds. All the metadata for each package lives in a Github repository, and package updates are accomplished through merging a Github pull request (here is an example of what a package update looks like in conda-forge).

Conda-forge is entirely community-founded and community-led, and while conda-forge is probably not yet mature enough to completely replace the default conda channel, Continuum’s founders have publicly stated that this is a direction they would support. You can read more about the promise of conda-forge in Wes McKinney’s recent blog post, conda-forge and PyData’s CentOS moment.

Myth #9: OK, but if Continuum Analytics folds, conda won’t work anymore right?

Reality: nothing about Conda inherently ties it to Continuum Analytics; the company serves the community by providing free hosting of build artifacts. All software distributions need to be hosted by somebody, even PyPI.

It’s true that even conda-forge publishes its package builds to http://anaconda.org/, a website owned and maintained by Continuum Analytics. But there is nothing in Conda that requires this site. In fact, the creation of Custom Channels in conda is well-documented, and there would be nothing to stop someone from building and hosting their own private distribution using Conda as a package manager (conda index is the relevant command). Given the openness of conda recipes and build systems on conda-forge, it would not be all that hard to mirror all of conda-forge on your own server if you have reason to do so.

If you’re still worried about Continuum Analytics – a for-profit company – serving the community by hosting conda packages, you should probably be equally worried about Rackspace – a for-profit company – serving the community by hosting the Python Package Index. In both cases, a for-profit company is integral to the current manifestation of the community’s package management system. But in neither case would the demise of that company threaten the underlying architecture of the build & distribution system, which is entirely free and open source. If either Rackspace or Continuum were to disappear, the community would simply have to find another host and/or financial sponsor for the open distribution it relies on.

Myth #10: Everybody should abandon (conda | pip) and use (pip | conda) instead!

Reality: pip and conda serve different needs, and we should be focused less on how they compete and more on how they work together.

As mentioned in Myth #2, Conda and pip are different projects with different intended audiences: pip installs python packages within any environment; conda installs any package within conda environments. Given the lofty ideals raised in the Zen of Python, one might hope that pip and conda could somehow be combined, so that there would be one and only one obvious way of installing packages.

But this will never happen. The goals of the two projects are just too different. Unless the pip project is broadly re-scoped, it will never be able to meaningfully install and track all the non-Python packages that conda does: the architecture is Python-specific and (rightly) Python-focused. Pip, along with PyPI, aims to be a flexible publication & distribution platform and manager for Python packages, and it does phenomenally well at that.

Likewise, unless the conda package is broadly re-scoped, it will never make sense for it to replace pip/PyPI as a general publishing & distribution platform for Python code. At its very core, conda concerns itself with the type of detailed dependency tracking that is required for robustly running a complex multi-language software stack across multiple platforms. Every installation artifact in conda’s repositories is tied to an exact dependency chain: by design, it wouldn’t allow you to, say, substitute Jython for Python in a given package. You could certainly use conda to build a Jython software stack, but each package would require a new Jython-specific installation artifact – that is what is required to maintain the strict dependency chain that conda users rely on. Pip is much more flexible here, but once cost of that is its inability to precisely define and resolve dependencies as conda does.

Finally, the focus on pip vs. conda entirely misses the broad swath of purpose-designed redistributors of Python code. From platform-specific package managers like apt, yum, macports, and homebrew, to cross-platform tools like bento, buildout, hashdist, and spack, there are a wide range of specific packaging solutions aimed at installing Python (and other) packages for particular users. It would be more fruitful for us to view these, as the Python Packaging Authority does, not as competitors to pip/PyPI, but as downstream tools that can take advantage of the heroic efforts of all those who have developed and maintained pip, PyPI, and associated toolchain.

Where to Go from Here?

So it seems we’re left with two packaging solutions which are distinct, but yet have broad overlap for many Python users (i.e. when installing Python packages in isolated environments). So where should the community go from here? I think the main thing we can do is make sure the projects (1) work together as well as possible, and (2) learn from each other’s successes.

Conda

As mentioned above, conda is already has a fully open toolchain, and is on a steady trend toward fully open packages (but is not entirely there just yet). An obvious direction is to push forward on community development and maintenance of the conda stack via conda-forge, perhaps eventually using it to replace conda’s current default channel.

As we push forward on this, I believe the conda and conda-forge community could benefit from imitating the clear and open governance model of the Python Packaging Authority. For example, PyPA has an open governance model with explicit goals, a clear roadmap for new developments and features, and well-defined channels of communication and discussion, and community oversight of the full pip/PyPI system from the ground up.

With conda and conda-forge, on the other hand, the code (and soon all recipes) is open, but the model for governance and control of the system is far less explicit. Given the importance of conda particularly in the PyData community, it would benefit all of this to clarify this somehow – perhaps under the umbrella of the NumFOCUS organization.

That being said, folks involved with conda-forge have told me that this is currently being addressed by the core team, including generation of governing documents, a code of conduct, and framework for enhancement proposals.

PyPI/pip

While the Python Package Index seems to have its governance in order, there are aspects of conda/conda-forge that I think would benefit it. For example, currently most Python packages can be loaded to conda-forge with just a few steps:

  1. Post a public code release somewhere on the web (on github, bitbucket, PyPI, etc.)
  2. Create a recipe/metadata file that points to this code and lists dependencies
  3. Open a pull request on conda-forge/staged-recipes

And that’s it. Once the pull request is merged, the binary builds on Windows, OSX, and Linux are automatically created and loaded to the conda-forge channel. Additionally, managing and updating the package takes place transparently via github, where package updates can be reviewed by collaborators and tested by CI systems before they go live.

I find this process far preferable to the (by comparison relatively opaque and manual) process of publishing to PyPI, which is mostly done by a single user working in private at a local terminal. Perhaps PyPI could take advantage of conda-forge’s existing build system, and creating an option to automatically build multi-platform wheels and source distributions, and automatically push them to PyPI in a single transparent command. It is definitely a possibility.

Postscript: Which Tool Should I Use?

I hope I’ve convinced you that conda and pip both have a role to play within the Python community. With that behind us, which should you use if you’re starting out? The answer depends on what you want to do:

If you have an existing system Python installation and you want to install packages in or on it, use pip+virtualenv. For example, perhaps you used apt or another system package manager to install Python, along with some packages linked to system tools that are not (yet) easily installable via conda or pip. Pip+virtualenv will allow you to install new Python packages and build environments on top of that existing distribution, and you should be able to rely on your system package manager for any difficult-to-install dependencies.

If you want to flexibly manage a multi-language software stack and don’t mind using an isolated environment, use conda. Conda’s multi-language dependency management and cross-platform binary installations can do things in this situation that pip cannot do. A huge benefit is that for most packages, the result will be immediately compatible with multiple operating systems.

If you want to install Python packages within an Isolated environment, pip+virtualenv and conda+conda-env are mostly interchangeable. This is the overlap region where both tools shine in their own way. That being said, I tend to prefer conda in this situation: Conda’s uniform, cross-platform, full-stack management of multiple parallel Python environments with robust dependency management has proven to be an incredible time-saver in my research, my teaching, and my software development work. Additionally, I find that my needs and the needs of my colleagues more often stray into areas of conda’s strengths (management of non-Python tools and dependencies) than into areas of pip’s strengths (environment-agnostic Python package management).

As an example, years ago I spent nearly a quarter with a colleague trying to install the complicated (non-Python) software stack that powers the megaman package, which we were developing together. The result of all our efforts was a single non-reproducible working stack on a single machine. Then conda-forge was introduced. We went through the process again, this time creating a conda recipe, from which a conda-forge feedstock was built. We now have a cross-platform solution that will install a working version of the package and its dependencies with a single command, in seconds, on nearly any computer. If there is a way to build and distribute software with that kind of dependency graph seamlessly with pip+PyPI, I haven’t seen it.


If you’ve read this far, I hope you’ve found this discussion useful. My own desire is that we as a community can continue to rally around both these tools, improving them for the benefit of current and future users. Python packaging has improved immensely in the last decade, and I’m excited to see where it will go from here.

Thanks to Filipe Fernandez, Aaron Meurer, Bryan van de Ven, and Phil Elson for helpful feedback on early drafts of this post. As always, any mistakes are my own.


Source: https://jakevdp.github.io/blog/2016/08/25/conda-myths-and-misconceptions/

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Powered by WordPress.com.

Up ↑

%d bloggers like this: