pandas module in python

- November 26, 2019

Pandas is an opensource library that allows to you perform data manipulation in Python. Pandas library is built on top of Numpy, meaning Pandas needs Numpy to operate. Pandas provide an easy way to create, manipulate and wrangle the data. Pandas is also an elegant solution for time series data.

Why use Pandas?

Data scientists use Pandas for its following advantages:

Easily handles missing data
It uses Series for one-dimensional data structure and DataFrame for multi-dimensional data structure
It provides an efficient way to slice the data
It provides a flexible way to merge, concatenate or reshape the data
It includes a powerful time series tool to work with

What is a data frame?

A data frame is a two-dimensional array, with labeled axes (rows and columns). A data frame is a standard way to store data.

Data frame is well-known by statistician and other data practitioners. A data frame is a tabular data, with rows to store the information and columns to name the information. For instance, the price can be the name of a column and 2,3,4 the price values.

Below a picture of a Pandas data frame:

	ITEM	PRICE
0	A	2
1	B	3

What is a Series?

A series is a one-dimensional data structure. It can have any data structure like integer, float, and string. It is useful when you want to perform computation or return a one-dimensional array. A series, by definition, cannot have multiple columns. For the latter case, please use the data frame structure.

Series has one parameters:

Data: can be a list, dictionary or scalar value

pd.Series([1.,2.,3.])

0 1.0

1 2.0

2 3.0

you create a Pandas series with a missing value for the third rows. Note, missing values in Python are noted "NaN." You can use numpy to create missing value: np.nan artificially.

Step 3) Using head function

step 4) Using tail function

Step 5) An excellent practice to get a clue about the data is to use describe(). It provides the counts, mean, std, min, max and percentile of the dataset.

df.describe()

Drop a column

You can drop columns using pd.drop()

df.drop(columns=['A', 'C'])

Concatenation

You can concatenate two DataFrame in Pandas. You can use pd.concat()

import numpy as np

df1 = pd.DataFrame({'name': ['John', 'Smith','Paul'],

                     'Age': ['25', '30', '50']},

                    index=[0, 1, 2])

df2 = pd.DataFrame({'name': ['Adam', 'Smith' ],

                     'Age': ['26', '11']},

                    index=[3, 4])

df_concat = pd.concat([df1,df2])

df_concat

Drop_duplicates

If a dataset can contain duplicates information use, `drop_duplicates` is an easy to exclude duplicate rows. You can see that `df_concat` has a duplicate observation, `Smith` appears twice in the column `name.`

df_concat.drop_duplicates('name')

Sort values

You can sort value with sort_values

df_concat.sort_values('Age')

Rename: change of index

You can use rename to rename a column in Pandas. The first value is the current column name and the second value is the new column name.

df_concat.rename(columns={"name": "Surname", "Age": "Age_ppl"})

Search This Blog

Techy Sheky