Originally posted on machinelearningmastery.
Python is a neat programming language because its syntax is simple, clear, and concise. But Python will not be so successful without the rich third-party libraries. Python is so famous for data science and machine learning that it becomes a de facto lingua franca just because we have so many libraries for those tasks. Without those libraries, Python is not too powerful.
After finishing this tutorial, you will learn
-
- Where are the Python libraries installed in your system
- What is PyPI and how a library repository can help your project
- How to use the
pip
command to use a library from the repository
Let’s get started.
Overview
This tutorial is in five parts, they are
- The Python ecosystem
- Python libraries location
- The pip command
- Search for a package
- Host your own repository
The Python ecosystem
In the old days before the Internet, the language and the libraries are separated. When you learn C from a textbook, you will not see anything to help you read a CSV file or open a PNG image. Same in the old days of Java. If you need anything not included in the official libraries, you need to search it from various places. How to download or install the libraries would be specific to the vendor of the library.
It would be way more convenient if we have a central repository to host many libraries and let us install the library with a unified interface, and allows us to check for new versions from time to time. Even better, we may also search on the repository with keywords to discover the library that can help our project. The CPAN is an example of libraries repository for Perl. Similarly, we have CRAN for R, RubyGems for Ruby, npm for Node.js, and maven for Java. For Python, we have PyPI (Python Package Index), https://pypi.org/.
The PyPI is platform agnostic. If you installed your Python in Windows by downloading the installer from python.org, you have the pip
command to access to PyPI. If you used homebrew on Mac to install Python, you also have the same pip
command. It is the same even if you use the built-in Python from Ubuntu Linux.
As a repository, you can find almost anything on PyPI. From large libraries like Tensorflow and PyTorch, to small things like minimal. Because of the vast amount of libraries available on PyPI, you can easily find tools that implemented some important component of your projects. Therefore, we have a strong and growing ecosystem of libraries in Python that making it more powerful every
Python libraries location
When we need a library in our Python scripts, we use
1
|
import module_name
|
but how can Python knows where to read the content of the module and load it for our scripts? Similar to how the bash shell in Linux or command prompt in Windows looks for the command to execute, Python depends on a list of paths to locate the module to load. At any time, we can check the path by printing the list sys.path
(after importing the sys
module). For example, in a Mac installation of Python via homebrew,
1
2
|
import sys
print(sys.path)
|
prints the following:
1
2
3
4
5
|
[”,
‘/usr/local/Cellar/python@3.9/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python39.zip’,
‘/usr/local/Cellar/python@3.9/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python3.9’,
‘/usr/local/Cellar/python@3.9/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python3.9/lib-dynload’,
‘/usr/local/lib/python3.9/site-packages’]
|
This means if you run import my_module
, Python will look for the my_module
in the same directory as your current location first (the first element, empty string). If not found, Python will check for the module located inside the zip file in the second element above. Then under the directory as the third element, and so on. The final path /usr/local/lib/python3.9/site-packages
is usually where you installed your third party libraries. The second, third and fourth elements above are where the built-in standard libraries located.
If you have some extra libraries installed elsewhere, you can set up your environment variable PYTHONPATH
and point to it. In Linux and Mac for example, we can run the command in the shell as follows:
1
|
$ PYTHONPATH=“/tmp:/var/tmp” python print_path.py
|
where print_path.py
is the two-line code above. Running this command will print the following:
1
2
3
4
5
|
[”, ‘/tmp’, ‘/var/tmp’,
‘/usr/local/Cellar/python@3.9/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python39.zip’,
‘/usr/local/Cellar/python@3.9/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python3.9’,
‘/usr/local/Cellar/python@3.9/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python3.9/lib-dynload’,
‘/usr/local/lib/python3.9/site-packages’]
|
which we see Python will search from /tmp
, then /var/tmp
, before checking the built-in libraries and installed third party libraries. When we set up PYTHONPATH
environment variable, we use colon “:
” to separate multiple paths to search for our import
. In case you are not familiar with the shell syntax, the above command line that defined the environment variable and run the Python script can be broken into two commands:
1
2
|
$ export PYTHONPATH=“/tmp:/var/tmp”
$ python print_path.py
|
If you’re using Windows, you need to do this instead:
1
2
3
|
C:\> set PYTHONPATH=“C:\temp;D:\temp”
C:\> python print_path.py
|
That is, we need to use semicolon “;
” to separate the paths.
Note: It is not recommanded, but you can modify sys.path
in your script before the import
statement. Python will search the new locations for the import
afterwards but it means to tie your script to a particular path. In other words, your script may not run on another computer.
The pip command
The last path in the sys.path
printed above is where your third party libraries normally installed. The pip
command is how you get the library from the Internet and install it to that location. The simplest syntax is:
1
|
pip install scikit–learn pandas
|
This will install two packages, scikit-learn and pandas. Later, you may want to upgrade the packages when a new version released. The syntax is:
1
|
pip install –U scikit–learn
|
where -U
means to upgrade. To know which packages are outdated, we can use the command:
1
|
pip list —outdated
|
It will print the list of all packages with a newer version in PyPI than your system, such as the following:
1
2
3
4
5
6
7
|
Package Version Latest Type
—————————- ———- ——– —–
absl-py 0.14.0 1.0.0 wheel
anyio 3.4.0 3.5.0 wheel
…
xgboost 1.5.1 1.5.2 wheel
yfinance 0.1.69 0.1.70 wheel
|
Without the --outdated
, the pip
command will show you all the installed packages and their versions. You can optionally show the location that each package is installed with the -V
option, such as the following:
1
2
3
4
5
6
7
8
9
10
11
12
|
$ pip list –v
Package Version Location Installer
—————————————— ————— ——————————————————— ————–
absl–py 0.14.0 /usr/local/lib/python3.9/site–packages pip
aiohttp 3.8.1 /usr/local/lib/python3.9/site–packages pip
aiosignal 1.2.0 /usr/local/lib/python3.9/site–packages pip
anyio 3.4.0 /usr/local/lib/python3.9/site–packages pip
...
word2number 1.1 /usr/local/lib/python3.9/site–packages pip
wrapt 1.12.1 /usr/local/lib/python3.9/site–packages pip
xgboost 1.5.1 /usr/local/lib/python3.9/site–packages pip
yfinance 0.1.69 /usr/local/lib/python3.9/site–packages pip
|
In case you need to check the summary of a package, you can use the pip show
command, e.g.,
1
2
3
4
5
6
7
8
9
10
11
|
$ pip show pandas
Name: pandas
Version: 1.3.4
Summary: Powerful data structures for data analysis, time series, and statistics
Home–page: https://pandas.pydata.org
Author: The Pandas Development Team
Author–email: pandas–dev@python.org
License: BSD–3–Clause
Location: /usr/local/lib/python3.9/site–packages
Requires: numpy, python–dateutil, pytz
Required–by: bert–score, copulae, datasets, pandas–datareader, seaborn, statsmodels, ta, textattack, yfinance
|
This gives you some information such as the home page, where you installed it, as well as what other packages it depends on and the packages depending on it.
When you need to remove a package (e.g., to free up the disk space), you can simply run
1
|
pip uninstall tensorflow
|
One final note to use the pip
command: There are two types of packages from pip. The packages distributed as source code, or the packages distributed as binary. They are different only when part of the module is not written in Python but in some other languages (e.g., C or Cython) and needs to compile before use. The source packages will be compiled on your machine but the binary distribution is already compiled, but specific to the platform (e.g., 64-bit Windows). Usually the latter is distributed as “wheel” packages and you need to have wheel
installed first to enjoy the full benefit:
1
2
|
pip install wheel
|
A large package such as Tensorflow will take many hours to compile from scratch. Therefore, it is advisible to have wheel
installed and use the wheel packages whenever it is available.
Search for a package
The newer version of pip
command disabled the search function because it imposed too much workload to the PyPI system.
The way we can look for a package on PyPI is to use the search box on its webpage
When you type in a keyword, such as “gradient boosting”, it will show you many packages that contains the keyword somewhere:
and you can click on each one for more details (usually including code examples) to determine which one fits your need.
If you prefer the command line, you can install the pip-search
package:
1
|
pip install pip–search
|
and then you can run the pip_search
command to search with a keyword:
1
|
pip_search gradient boosting
|
It will not give you everything on PyPI because there would be thousands of them. But it will give you the most relevant results. Below is the result from a Mac terminal:
Host your own repository
PyPI is a repository on the Internet. But the pip
command does not use it exclusively. If you have some reason wants to have your own PyPI server (for example, hosting internally in your corporate network so your pip
does not goes beyond your firewall), you can try out the pypiserver
package:
1
|
pip install pypiserver
|
following the package’s documentation, you can set up your server using pypi-server
command. Then, you can upload the package and start serving. The detail on how to configure and set up your own server would be too long to describe in detail here. But what it does is to provide an index of available packages in the format that pip
command can understand, and provide the package for downloading when pip
requests a particular one.
If you have your own server, you can install a package in pip
by
1
|
pip install pandas —index–url https://192.168.0.234:8080
|
where the address after --index-url
is the host and port number of your own server.
PyPI is not the only repository. If you installed Python with Anaconda, you have an alternative system conda
to install packages. The syntax is similar (almost always replace pip
with conda
will work as expected). However, you should be reminded that they are two different systems that work independently.
Further reading
This section provides more resources on the topic if you are looking to go deeper.
- pip documentation, https://pip.pypa.io/en/stable/
- Python package index, https://pypi.org/
- pypiserver package, https://pypi.org/project/pypiserver/
Summary
In this tutorial, you’ve discovered the command pip
and how it brings you the abundant packages from the Python ecosystem to help your project. Specifically you learned
- How to look for a package from PyPI
- How Python manage its libraries in your system
- How to install, upgrade, and remove a package from your system
- How can we host our own version of PyPI in our network
Source: machinelearningmastery