.. _10min_tut_03_subset: {{ header }} .. ipython:: python import pandas as pd .. raw:: html
Data used for this tutorial:
How do I select a subset of a ``DataFrame``? ============================================ How do I select specific columns from a ``DataFrame``? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. image:: ../../_static/schemas/03_subset_columns.svg :align: center .. raw:: html Each column in a :class:`DataFrame` is a :class:`Series`. As a single column is selected, the returned object is a pandas :class:`Series`. We can verify this by checking the type of the output: .. ipython:: python type(titanic["Age"]) And have a look at the ``shape`` of the output: .. ipython:: python titanic["Age"].shape :attr:`DataFrame.shape` is an attribute (remember :ref:`tutorial on reading and writing <10min_tut_02_read_write>`, do not use parentheses for attributes) of a pandas ``Series`` and ``DataFrame`` containing the number of rows and columns: *(nrows, ncolumns)*. A pandas Series is 1-dimensional and only the number of rows is returned. .. raw:: html .. note:: The inner square brackets define a :ref:`Python list ` with column names, whereas the outer square brackets are used to select the data from a pandas ``DataFrame`` as seen in the previous example. The returned data type is a pandas DataFrame: .. ipython:: python type(titanic[["Age", "Sex"]]) .. ipython:: python titanic[["Age", "Sex"]].shape The selection returned a ``DataFrame`` with 891 rows and 2 columns. Remember, a ``DataFrame`` is 2-dimensional with both a row and column dimension. .. raw:: html
To user guide For basic information on indexing, see the user guide section on :ref:`indexing and selecting data `. .. raw:: html
How do I filter specific rows from a ``DataFrame``? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. image:: ../../_static/schemas/03_subset_rows.svg :align: center .. raw:: html The condition inside the selection brackets ``titanic["Age"] > 35`` checks for which rows the ``Age`` column has a value larger than 35: .. ipython:: python titanic["Age"] > 35 The output of the conditional expression (``>``, but also ``==``, ``!=``, ``<``, ``<=``,… would work) is actually a pandas ``Series`` of boolean values (either ``True`` or ``False``) with the same number of rows as the original ``DataFrame``. Such a ``Series`` of boolean values can be used to filter the ``DataFrame`` by putting it in between the selection brackets ``[]``. Only rows for which the value is ``True`` will be selected. We know from before that the original Titanic ``DataFrame`` consists of 891 rows. Let’s have a look at the number of rows which satisfy the condition by checking the ``shape`` attribute of the resulting ``DataFrame`` ``above_35``: .. ipython:: python above_35.shape .. raw:: html The above is equivalent to filtering by rows for which the class is either 2 or 3 and combining the two statements with an ``|`` (or) operator: .. ipython:: python class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)] class_23.head() .. note:: When combining multiple conditional statements, each condition must be surrounded by parentheses ``()``. Moreover, you can not use ``or``/``and`` but need to use the ``or`` operator ``|`` and the ``and`` operator ``&``. .. raw:: html
To user guide See the dedicated section in the user guide about :ref:`boolean indexing ` or about the :ref:`isin function `. .. raw:: html
.. raw:: html You might wonder what actually changed, as the first 5 lines are still the same values. One way to verify is to check if the shape has changed: .. ipython:: python age_no_na.shape .. raw:: html
To user guide For more dedicated functions on missing values, see the user guide section about :ref:`handling missing data `. .. raw:: html
.. _10min_tut_03_subset.rows_and_columns: How do I select specific rows and columns from a ``DataFrame``? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. image:: ../../_static/schemas/03_subset_columns_rows.svg :align: center .. raw:: html When using column names, row labels or a condition expression, use the ``loc`` operator in front of the selection brackets ``[]``. For both the part before and after the comma, you can use a single label, a list of labels, a slice of labels, a conditional expression or a colon. Using a colon specifies you want to select all rows or columns. .. raw:: html When selecting specific rows and/or columns with ``loc`` or ``iloc``, new values can be assigned to the selected data. For example, to assign the name ``anonymous`` to the first 3 elements of the fourth column: .. ipython:: python titanic.iloc[0:3, 3] = "anonymous" titanic.head() .. raw:: html
To user guide See the user guide section on :ref:`different choices for indexing ` to get more insight into the usage of ``loc`` and ``iloc``. .. raw:: html
.. raw:: html

REMEMBER

- When selecting subsets of data, square brackets ``[]`` are used. - Inside these square brackets, you can use a single column/row label, a list of column/row labels, a slice of labels, a conditional expression or a colon. - Use ``loc`` for label-based selection (using row/column names). - Use ``iloc`` for position-based selection (using table positions). - You can assign new values to a selection based on ``loc``/``iloc``. .. raw:: html
.. raw:: html
To user guide A full overview of indexing is provided in the user guide pages on :ref:`indexing and selecting data `. .. raw:: html