.. _10min_tut_10_text:
{{ header }}
.. ipython:: python
import pandas as pd
.. raw:: html
-
.. include:: includes/titanic.rst
.. ipython:: python
titanic = pd.read_csv("data/titanic.csv")
titanic.head()
.. raw:: html
How to manipulate textual data
------------------------------
.. raw:: html
-
Make all name characters lowercase.
.. ipython:: python
titanic["Name"].str.lower()
To make each of the strings in the ``Name`` column lowercase, select the ``Name`` column
(see the :ref:`tutorial on selection of data <10min_tut_03_subset>`), add the ``str`` accessor and
apply the ``lower`` method. As such, each of the strings is converted element-wise.
.. raw:: html
Similar to datetime objects in the :ref:`time series tutorial <10min_tut_09_timeseries>`
having a ``dt`` accessor, a number of
specialized string methods are available when using the ``str``
accessor. These methods have in general matching names with the
equivalent built-in string methods for single elements, but are applied
element-wise (remember :ref:`element-wise calculations <10min_tut_05_columns>`?)
on each of the values of the columns.
.. raw:: html
-
Create a new column ``Surname`` that contains the surname of the passengers by extracting the part before the comma.
.. ipython:: python
titanic["Name"].str.split(",")
Using the :meth:`Series.str.split` method, each of the values is returned as a list of
2 elements. The first element is the part before the comma and the
second element is the part after the comma.
.. ipython:: python
titanic["Surname"] = titanic["Name"].str.split(",").str.get(0)
titanic["Surname"]
As we are only interested in the first part representing the surname
(element 0), we can again use the ``str`` accessor and apply :meth:`Series.str.get` to
extract the relevant part. Indeed, these string functions can be
concatenated to combine multiple functions at once!
.. raw:: html
.. raw:: html
To user guide
More information on extracting parts of strings is available in the user guide section on :ref:`splitting and replacing strings `.
.. raw:: html
.. raw:: html
-
Extract the passenger data about the countesses on board of the Titanic.
.. ipython:: python
titanic["Name"].str.contains("Countess")
.. ipython:: python
titanic[titanic["Name"].str.contains("Countess")]
(*Interested in her story? See* `Wikipedia `__\ *!*)
The string method :meth:`Series.str.contains` checks for each of the values in the
column ``Name`` if the string contains the word ``Countess`` and returns
for each of the values ``True`` (``Countess`` is part of the name) or
``False`` (``Countess`` is not part of the name). This output can be used
to subselect the data using conditional (boolean) indexing introduced in
the :ref:`subsetting of data tutorial <10min_tut_03_subset>`. As there was
only one countess on the Titanic, we get one row as a result.
.. raw:: html
.. note::
More powerful extractions on strings are supported, as the
:meth:`Series.str.contains` and :meth:`Series.str.extract` methods accept `regular
expressions `__, but are out of
the scope of this tutorial.
.. raw:: html
To user guide
More information on extracting parts of strings is available in the user guide section on :ref:`string matching and extracting `.
.. raw:: html
.. raw:: html
-
Which passenger of the Titanic has the longest name?
.. ipython:: python
titanic["Name"].str.len()
To get the longest name we first have to get the lengths of each of the
names in the ``Name`` column. By using pandas string methods, the
:meth:`Series.str.len` function is applied to each of the names individually
(element-wise).
.. ipython:: python
titanic["Name"].str.len().idxmax()
Next, we need to get the corresponding location, preferably the index
label, in the table for which the name length is the largest. The
:meth:`~Series.idxmax` method does exactly that. It is not a string method and is
applied to integers, so no ``str`` is used.
.. ipython:: python
titanic.loc[titanic["Name"].str.len().idxmax(), "Name"]
Based on the index name of the row (``307``) and the column (``Name``),
we can do a selection using the ``loc`` operator, introduced in the
:ref:`tutorial on subsetting <10min_tut_03_subset>`.
.. raw:: html
.. raw:: html
-
In the "Sex" column, replace values of "male" by "M" and values of "female" by "F".
.. ipython:: python
titanic["Sex_short"] = titanic["Sex"].replace({"male": "M", "female": "F"})
titanic["Sex_short"]
Whereas :meth:`~Series.replace` is not a string method, it provides a convenient way
to use mappings or vocabularies to translate certain values. It requires
a ``dictionary`` to define the mapping ``{from: to}``.
.. raw:: html
.. warning::
There is also a :meth:`~Series.str.replace` method available to replace a
specific set of characters. However, when having a mapping of multiple
values, this would become:
::
titanic["Sex_short"] = titanic["Sex"].str.replace("female", "F")
titanic["Sex_short"] = titanic["Sex_short"].str.replace("male", "M")
This would become cumbersome and easily lead to mistakes. Just think (or
try out yourself) what would happen if those two statements are applied
in the opposite order…
.. raw:: html
REMEMBER
- String methods are available using the ``str`` accessor.
- String methods work element-wise and can be used for conditional
indexing.
- The ``replace`` method is a convenient method to convert values
according to a given dictionary.
.. raw:: html
.. raw:: html
To user guide
A full overview is provided in the user guide pages on :ref:`working with text data `.
.. raw:: html