Exploring the Python Ecosystem

Exploring the Python Ecosystem

Originally posted on machinelearningmastery.

Python is a neat programming language because its syntax is simple, clear, and concise. But Python will not be so successful without the rich third-party libraries. Python is so famous for data science and machine learning that it becomes a de facto lingua franca just because we have so many libraries for those tasks. Without those libraries, Python is not too powerful.

After finishing this tutorial, you will learn

    • Where are the Python libraries installed in your system
    • What is PyPI and how a library repository can help your project
    • How to use the pip command to use a library from the repository

Let’s get started.

Overview

This tutorial is in five parts, they are

  • The Python ecosystem
  • Python libraries location
  • The pip command
  • Search for a package
  • Host your own repository

The Python ecosystem

In the old days before the Internet, the language and the libraries are separated. When you learn C from a textbook, you will not see anything to help you read a CSV file or open a PNG image. Same in the old days of Java. If you need anything not included in the official libraries, you need to search it from various places. How to download or install the libraries would be specific to the vendor of the library.

It would be way more convenient if we have a central repository to host many libraries and let us install the library with a unified interface, and allows us to check for new versions from time to time. Even better, we may also search on the repository with keywords to discover the library that can help our project. The CPAN is an example of libraries repository for Perl. Similarly, we have CRAN for R, RubyGems for Ruby, npm for Node.js, and maven for Java. For Python, we have PyPI (Python Package Index), https://pypi.org/.

The PyPI is platform agnostic. If you installed your Python in Windows by downloading the installer from python.org, you have the pip command to access to PyPI. If you used homebrew on Mac to install Python, you also have the same pip command. It is the same even if you use the built-in Python from Ubuntu Linux.

As a repository, you can find almost anything on PyPI. From large libraries like Tensorflow and PyTorch, to small things like minimal. Because of the vast amount of libraries available on PyPI, you can easily find tools that implemented some important component of your projects. Therefore, we have a strong and growing ecosystem of libraries in Python that making it more powerful every

Python libraries location

When we need a library in our Python scripts, we use

but how can Python knows where to read the content of the module and load it for our scripts? Similar to how the bash shell in Linux or command prompt in Windows looks for the command to execute, Python depends on a list of paths to locate the module to load. At any time, we can check the path by printing the list sys.path (after importing the sys module). For example, in a Mac installation of Python via homebrew,

prints the following:

This means if you run import my_module, Python will look for the my_module in the same directory as your current location first (the first element, empty string). If not found, Python will check for the module located inside the zip file in the second element above. Then under the directory as the third element, and so on. The final path /usr/local/lib/python3.9/site-packages is usually where you installed your third party libraries. The second, third and fourth elements above are where the built-in standard libraries located.

If you have some extra libraries installed elsewhere, you can set up your environment variable PYTHONPATH and point to it. In Linux and Mac for example, we can run the command in the shell as follows:

where print_path.py is the two-line code above. Running this command will print the following:

which we see Python will search from /tmp, then /var/tmp, before checking the built-in libraries and installed third party libraries. When we set up PYTHONPATH environment variable, we use colon “:” to separate multiple paths to search for our import. In case you are not familiar with the shell syntax, the above command line that defined the environment variable and run the Python script can be broken into two commands:

If you’re using Windows, you need to do this instead:

That is, we need to use semicolon “;” to separate the paths.

Note: It is not recommanded, but you can modify sys.path in your script before the import statement. Python will search the new locations for the import afterwards but it means to tie your script to a particular path. In other words, your script may not run on another computer.

The pip command

The last path in the sys.path printed above is where your third party libraries normally installed. The pip command is how you get the library from the Internet and install it to that location. The simplest syntax is:

This will install two packages, scikit-learn and pandas. Later, you may want to upgrade the packages when a new version released. The syntax is:

where -U means to upgrade. To know which packages are outdated, we can use the command:

It will print the list of all packages with a newer version in PyPI than your system, such as the following:

Without the --outdated, the pip command will show you all the installed packages and their versions. You can optionally show the location that each package is installed with the -V option, such as the following:

In case you need to check the summary of a package, you can use the pip show command, e.g.,

This gives you some information such as the home page, where you installed it, as well as what other packages it depends on and the packages depending on it.

When you need to remove a package (e.g., to free up the disk space), you can simply run

One final note to use the pip command: There are two types of packages from pip. The packages distributed as source code, or the packages distributed as binary. They are different only when part of the module is not written in Python but in some other languages (e.g., C or Cython) and needs to compile before use. The source packages will be compiled on your machine but the binary distribution is already compiled, but specific to the platform (e.g., 64-bit Windows). Usually the latter is distributed as “wheel” packages and you need to have wheel installed first to enjoy the full benefit:

A large package such as Tensorflow will take many hours to compile from scratch. Therefore, it is advisible to have wheel installed and use the wheel packages whenever it is available.

Search for a package

The newer version of pip command disabled the search function because it imposed too much workload to the PyPI system.

The way we can look for a package on PyPI is to use the search box on its webpage

When you type in a keyword, such as “gradient boosting”, it will show you many packages that contains the keyword somewhere:

and you can click on each one for more details (usually including code examples) to determine which one fits your need.

If you prefer the command line, you can install the pip-search package:

and then you can run the pip_search command to search with a keyword:

It will not give you everything on PyPI because there would be thousands of them. But it will give you the most relevant results. Below is the result from a Mac terminal:

Host your own repository

PyPI is a repository on the Internet. But the pip command does not use it exclusively. If you have some reason wants to have your own PyPI server (for example, hosting internally in your corporate network so your pip does not goes beyond your firewall), you can try out the pypiserver package:

following the package’s documentation, you can set up your server using pypi-server command. Then, you can upload the package and start serving. The detail on how to configure and set up your own server would be too long to describe in detail here. But what it does is to provide an index of available packages in the format that pip command can understand, and provide the package for downloading when pip requests a particular one.

If you have your own server, you can install a package in pip by

where the address after --index-url is the host and port number of your own server.

PyPI is not the only repository. If you installed Python with Anaconda, you have an alternative system conda to install packages. The syntax is similar (almost always replace pip with conda will work as expected). However, you should be reminded that they are two different systems that work independently.

Further reading

This section provides more resources on the topic if you are looking to go deeper.

Summary

In this tutorial, you’ve discovered the command pip and how it brings you the abundant packages from the Python ecosystem to help your project. Specifically you learned

  • How to look for a package from PyPI
  • How Python manage its libraries in your system
  • How to install, upgrade, and remove a package from your system
  • How can we host our own version of PyPI in our network

Source: machinelearningmastery