.. _10min_tut_03_subset:
{{ header }}
.. ipython:: python
import pandas as pd
.. raw:: html
` with column names, whereas
the outer square brackets are used to select the data from a pandas
``DataFrame`` as seen in the previous example.
The returned data type is a pandas DataFrame:
.. ipython:: python
type(titanic[["Age", "Sex"]])
.. ipython:: python
titanic[["Age", "Sex"]].shape
The selection returned a ``DataFrame`` with 891 rows and 2 columns. Remember, a
``DataFrame`` is 2-dimensional with both a row and column dimension.
.. raw:: html
To user guide
For basic information on indexing, see the user guide section on :ref:`indexing and selecting data `.
.. raw:: html
How do I filter specific rows from a ``DataFrame``?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. image:: ../../_static/schemas/03_subset_rows.svg
:align: center
.. raw:: html
-
I’m interested in the passengers older than 35 years.
.. ipython:: python
above_35 = titanic[titanic["Age"] > 35]
above_35.head()
To select rows based on a conditional expression, use a condition inside
the selection brackets ``[]``.
.. raw:: html
The condition inside the selection
brackets ``titanic["Age"] > 35`` checks for which rows the ``Age``
column has a value larger than 35:
.. ipython:: python
titanic["Age"] > 35
The output of the conditional expression (``>``, but also ``==``,
``!=``, ``<``, ``<=``,… would work) is actually a pandas ``Series`` of
boolean values (either ``True`` or ``False``) with the same number of
rows as the original ``DataFrame``. Such a ``Series`` of boolean values
can be used to filter the ``DataFrame`` by putting it in between the
selection brackets ``[]``. Only rows for which the value is ``True``
will be selected.
We know from before that the original Titanic ``DataFrame`` consists of
891 rows. Let’s have a look at the number of rows which satisfy the
condition by checking the ``shape`` attribute of the resulting
``DataFrame`` ``above_35``:
.. ipython:: python
above_35.shape
.. raw:: html
-
I’m interested in the Titanic passengers from cabin class 2 and 3.
.. ipython:: python
class_23 = titanic[titanic["Pclass"].isin([2, 3])]
class_23.head()
Similar to the conditional expression, the :func:`~Series.isin` conditional function
returns a ``True`` for each row the values are in the provided list. To
filter the rows based on such a function, use the conditional function
inside the selection brackets ``[]``. In this case, the condition inside
the selection brackets ``titanic["Pclass"].isin([2, 3])`` checks for
which rows the ``Pclass`` column is either 2 or 3.
.. raw:: html
The above is equivalent to filtering by rows for which the class is
either 2 or 3 and combining the two statements with an ``|`` (or)
operator:
.. ipython:: python
class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]
class_23.head()
.. note::
When combining multiple conditional statements, each condition
must be surrounded by parentheses ``()``. Moreover, you can not use
``or``/``and`` but need to use the ``or`` operator ``|`` and the ``and``
operator ``&``.
.. raw:: html
To user guide
See the dedicated section in the user guide about :ref:`boolean indexing ` or about the :ref:`isin function `.
.. raw:: html
.. raw:: html
-
I want to work with passenger data for which the age is known.
.. ipython:: python
age_no_na = titanic[titanic["Age"].notna()]
age_no_na.head()
The :meth:`~Series.notna` conditional function returns a ``True`` for each row the
values are not a ``Null`` value. As such, this can be combined with the
selection brackets ``[]`` to filter the data table.
.. raw:: html
You might wonder what actually changed, as the first 5 lines are still
the same values. One way to verify is to check if the shape has changed:
.. ipython:: python
age_no_na.shape
.. raw:: html
To user guide
For more dedicated functions on missing values, see the user guide section about :ref:`handling missing data `.
.. raw:: html
.. _10min_tut_03_subset.rows_and_columns:
How do I select specific rows and columns from a ``DataFrame``?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. image:: ../../_static/schemas/03_subset_columns_rows.svg
:align: center
.. raw:: html
-
I’m interested in the names of the passengers older than 35 years.
.. ipython:: python
adult_names = titanic.loc[titanic["Age"] > 35, "Name"]
adult_names.head()
In this case, a subset of both rows and columns is made in one go and
just using selection brackets ``[]`` is not sufficient anymore. The
``loc``/``iloc`` operators are required in front of the selection
brackets ``[]``. When using ``loc``/``iloc``, the part before the comma
is the rows you want, and the part after the comma is the columns you
want to select.
.. raw:: html
When using column names, row labels or a condition expression, use
the ``loc`` operator in front of the selection brackets ``[]``. For both
the part before and after the comma, you can use a single label, a list
of labels, a slice of labels, a conditional expression or a colon. Using
a colon specifies you want to select all rows or columns.
.. raw:: html
-
I’m interested in rows 10 till 25 and columns 3 to 5.
.. ipython:: python
titanic.iloc[9:25, 2:5]
Again, a subset of both rows and columns is made in one go and just
using selection brackets ``[]`` is not sufficient anymore. When
specifically interested in certain rows and/or columns based on their
position in the table, use the ``iloc`` operator in front of the
selection brackets ``[]``.
.. raw:: html
When selecting specific rows and/or columns with ``loc`` or ``iloc``,
new values can be assigned to the selected data. For example, to assign
the name ``anonymous`` to the first 3 elements of the fourth column:
.. ipython:: python
titanic.iloc[0:3, 3] = "anonymous"
titanic.head()
.. raw:: html
To user guide
See the user guide section on :ref:`different choices for indexing ` to get more insight into the usage of ``loc`` and ``iloc``.
.. raw:: html
.. raw:: html
REMEMBER
- When selecting subsets of data, square brackets ``[]`` are used.
- Inside these square brackets, you can use a single column/row label, a list
of column/row labels, a slice of labels, a conditional expression or
a colon.
- Use ``loc`` for label-based selection (using row/column names).
- Use ``iloc`` for position-based selection (using table positions).
- You can assign new values to a selection based on ``loc``/``iloc``.
.. raw:: html
.. raw:: html
To user guide
A full overview of indexing is provided in the user guide pages on :ref:`indexing and selecting data `.
.. raw:: html