How to Use Python Pandas to Fill in Missing Data

Data missing is a common issue in data analysis. It can happen due to a variety of factors, including incomplete surveys, corrupted files, or human mistake. Missing data can make statistical analysis and machine learning jobs challenging. Pandas is a Python library for data analysis that provides a number of tools for dealing with missing data. This post will go over how to fill missing data in Pandas and how to install Pandas step-by-step procedures.

Missing data can be represented by two values:

None: None is a Python singleton object that is widely used in Python programs to signify missing data. It is a general placeholder that can be provided to variables or elements in a Pandas DataFrame or Series to indicate the lack of a value.

NaN: “Not a Number” is a special floating-point value recognized by systems that employ the standard IEEE floating-point encoding. It is a specialized representation of numerical data that is absent or undefined. NaN is frequently used in Pandas to indicate missing or undefined values in numeric columns of DataFrames or Series.

How to Install Pandas

Before we begin, please ensure that pandas are installed in your Python virtual environment. You may easily install it using your terminal and the pip package manager. Launch your terminal and type the following command:

pip install pandas

This command will download and install the pandas library, enabling you to take advantage of its extensive data analysis features. Once the installation is complete, you will be able to use pandas in Python to fill in missing data.

Useful functions in Pandas

Pandas considers None and NaN to be interchangeable for signalling missing or null values. Pandas has several useful functions for identifying, deleting, and replacing null values in a DataFrame to help with this convention:

  • isnull():
  • notnull():
  • dropna():
  • fillna():
  • replace():
  • interpolate():

Missing values using isnull() and notnull()

The isnull() and notnull() functions in Pandas DataFrame can be used to check for missing values. These functions are not just applicable to DataFrames, but may also be used with Pandas Series to identify null values. Here’s how you can make use of them:

Checking for isnull() function

We utilize the isnull() function in Pandas DataFrame to check for null values. This function returns a dataframe of Boolean values that are True for None values.

import pandas as pd # Create a DataFrame df = pd.DataFrame({‘A’: [1, 2, None], ‘B’: [None, 5, 6]}) # Check for missing values using isnull() missing_values = df.isnull() print(missing_values)

See also  5 handy AI tools for school that students, teachers, and parents can use, too

Output:

Checking for notnull() function

The notnull() function returns a DataFrame of the same shape as the original DataFrame, with each element being True if it has a non-null value and False otherwise.

import pandas as pd # Create a DataFrame df = pd.DataFrame({‘A’: [1, 2, None], ‘B’: [None, 5, 6]}) # Check for missing values using notnull() missing_values = df.notnull() print(missing_values)

Output:

Missing values in a Series

  • When applied directly to a Series, the isnull() function returns a boolean Series showing which values are null.
  • These functions are useful for discovering missing values in DataFrames and Series, allowing you to better examine and handle your data’s null values.

# Create a Series s = pd.Series([1, None, 3, 4, None]) # Check for missing values using isnull() missing_values = s.isnull() print(missing_values)

Output:

dropna() Method

When using the dropna() function to remove null values from a DataFrame, you may choose whether to remove rows or columns. Dropna() by default removes all rows with null values. You can, however, remove columns with null values by supplying the axis parameter.

The updated code for removing null values from a DataFrame is as follows:

Example1: drop rows with at least one null value using the dropna() function:

# importing pandas as pd import pandas as pd # importing numpy as np import numpy as np # dictionary of lists dict = {‘First Score’:[100, 90, np.nan, 95], ‘Second Score’: [30, np.nan, 45, 56], ‘Third Score’:[52, 40, 80, 98], ‘Fourth Score’:[np.nan, np.nan, np.nan, 65]} # creating a dataframe from dictionary df = pd.DataFrame(dict) df

Output

Missing Data
How to Use Python Pandas to Fill in Missing Data 1

Example2: drop rows with at least one Nan value (Null value)

# importing pandas as pd import pandas as pd # importing numpy as np import numpy as np # dictionary of lists dict = {‘First Score’:[100, 90, np.nan, 95], ‘Second Score’: [30, np.nan, 45, 56], ‘Third Score’:[52, 40, 80, 98], ‘Fourth Score’:[np.nan, np.nan, np.nan, 65]} # creating a dataframe from dictionary df = pd.DataFrame(dict) # using dropna() function df.dropna()

Output:

How to Use Python Pandas to Fill in Missing Data 2

fillna() Method to Fill Missing Values

The pandas fillna() method fills empty rows in your dataset with a provided value. It offers freedom and optional arguments for customizing the filling procedure. Let’s look at the approaches for filling missing data with the fillna() method and the options available:

  • Value
  • Method
  • Inplace

Sample DataFrame

# Load a Sample Dataset import pandas as pd df = pd.DataFrame({ “Name”: [‘Alice’, ‘Bob’, None, ‘David’, None, ‘Fiona’, ‘George’], “Age”: [25, None, 23, 35, None, 31, 28], “Gender”: [‘F’, ‘M’, ‘M’, None, ‘F’, ‘F’, ‘M’], “Years”: [3, None, None, None, 7, None, 2] }) print(df.head())

Output:

Fill Missing Data Value

  • The value parameter indicates the value to be inserted into the missing rows. It can be a constant value, a computed value, or any other value you specify. fillna(0), for example, can be used to replace missing data with 0.
  • Code for utilizing the fillna() method to replace all missing values in a Pandas column, notably the “Years” column, with 0:
See also  Fast SAM: The Easy Way to Segment Anything

# Load a Sample Dataset import pandas as pd df = pd.DataFrame({ “Name”: [‘Alice’, ‘Bob’, None, ‘David’, None, ‘Fiona’, ‘George’], “Age”: [25, None, 23, 35, None, 31, 28], “Gender”: [‘F’, ‘M’, ‘M’, None, ‘F’, ‘F’, ‘M’], “Years”: [3, None, None, None, 7, None, 2] }) df[‘Years’] = df[‘Years’].fillna(0) print(df.head())

Output:

Fill Missing Data Method

Using the method parameter, you can fill in missing values in a given direction. Method=’ffill’ (forward fill) replaces missing values with the previous non-missing value, whereas method=’bfill’ (backward fill) replaces missing values with the next non-missing value.

Code to show forward and backfilling in pandas using the.fillna() method:

# Load a Sample Dataset import pandas as pd df = pd.DataFrame({ “Name”: [‘Alice’, ‘Bob’, None, ‘David’, None, ‘Fiona’, ‘George’], “Age”: [25, None, 23, 35, None, 31, 28], “Gender”: [‘F’, ‘M’, ‘M’, None, ‘F’, ‘F’, ‘M’], “Years”: [3, None, None, None, 7, None, 2] }) # Forward fill missing data using .fillna() df[‘Years’] = df[‘Years’].fillna(method=’ffill’) print(df)

Output:

  • Utilizing code to show forward and backfillingA DataFrame df is generated in this code, with a “Years” column having missing values represented as None. To do forward-filling, the.fillna() function is applied to the “Years” column with the option method=’ffill’.
  • The output shows that the missing values in the “Years” column are filled with the value that came before the gap. This method is especially effective with time series data, because filling missing values with the most recently observed value can maintain the time series’ continuity.

Fill Missing Data InPlace

  • The inplace parameter is a conditional statement that specifies whether the alteration is applied to the DataFrame permanently.
  • It is set to False by default, which means that the original DataFrame remains untouched. Setting inplace=True permanently alters the DataFrame.

# Load a Sample Dataset import pandas as pd df = pd.DataFrame({ “Name”: [‘Alice’, ‘Bob’, None, ‘David’, None, ‘Fiona’, ‘George’], “Age”: [25, None, 23, 35, None, 31, 28], “Gender”: [‘F’, ‘M’, ‘M’, None, ‘F’, ‘F’, ‘M’], “Years”: [3, None, None, None, 7, None, 2] }) # Fill Missing Values In Place df[‘Name’].fillna(‘Missing’, inplace=True) print(df.head())

Output

Output

By combining these optional arguments with the fillna() method, you can easily personalize the process of filling missing data to suit individual needs. Let’s now look at how to use the fillna() method to fill in missing data.

replace() Missing Data

The pandas replace() method is a powerful tool for replacing values within a DataFrame that is not limited to empty cells or NaN values. It enables you to replace any defined value with a value of your choosing.

See also  Poe AI: The New Chatbot App from Quora

Replace(), like fillna(), can be used to replace NaN values in a given column with the mean, median, mode, or any other desired value. The method additionally accepts the inplace keyword parameter, which allows you to directly edit the DataFrame.

Let’s explore how the replace() method works by replacing null rows with their mean, median, or mode in named columns:

import pandas as pd df = { “Array_1”: [49.50, 70], “Array_2”: [65.1, 49.50] } data = pd.DataFrame(df) print(data.replace(49.50, 50))

Output

  • You may essentially replace the null rows in the corresponding columns with the computed values by running these lines of code.
  • The replace() method allows you to replace individual values within your DataFrame, allowing you to handle missing data in a flexible manner.

Fill Missing Data With interpolate()

The pandas interpolate() function is a powerful way for predicting missing values in a DataFrame based on existing values. This method can provide reasonable estimates for missing rows by employing mathematical interpolation techniques.

It’s worth noting that the interpolate() technique only works with numeric columns because it relies on mathematical calculations to fill in the missing values. Furthermore, putting the inplace keyword parameter to True permanently alters the DataFrame.

Run the following code to observe how the interpolate() method works:

# importing pandas as pd import pandas as pd # Creating the dataframe df = pd.DataFrame({“A”:[12, 4, 5, None, 1], “B”:[None, 2, 54, 3, None], “C”:[20, 16, None, 3, 8], “D”:[14, 3, None, None, 6]}) # Print the dataframe df

Output

Missing Data
How to Use Python Pandas to Fill in Missing Data 3

The linear approach is used to interpolate missing data, which treats the values as equally spaced and ignores the index:

# importing pandas as pd import pandas as pd # Creating the dataframe df = pd.DataFrame({“A”:[12, 4, 5, None, 1], “B”:[None, 2, 54, 3, None], “C”:[20, 16, None, 3, 8], “D”:[14, 3, None, None, 6]}) # to interpolate the missing values df.interpolate(method =’linear’, limit_direction =’forward’) # Print the dataframe df

Output

Missing Data
How to Use Python Pandas to Fill in Missing Data 4
  • The interpolate() method is applied to the DataFrame in the preceding code, automatically considering just the numeric columns. The technique parameter is set to ‘linear,’ suggesting that linear interpolation will be used to estimate missing values.
  • The limit_direction argument defines whether interpolation should be performed in the backward (‘backward’) or forward (‘forward’) direction.
  • By running these lines of code, you may conduct interpolation on the DataFrame’s numeric columns, filling in missing values with estimated values based on the existing data.
  • By applying mathematical estimating approaches, the interpolate() method provides a strong tool for dealing with missing data, particularly in numeric columns.

Also read: You might also find useful our guide on Exploring the Power of PandasAI

Conclusion

In conclusion, Python Pandas offers sophisticated tools for filling in missing data. You may successfully manage missing values in your data by using functions like fillna() and interpolate() and specifying methods like forward-filling, back-filling, or linear interpolation. These strategies assure data integrity and keep your dataset’s structure and integrity. You can securely handle missing data with Pandas, delivering accurate and dependable analyses and insights. Please share your thoughts and feedback in the comment section below.