pandas module in python

Pandas is an opensource library that allows to you perform data manipulation in Python. Pandas library is built on top of Numpy, meaning Pandas needs Numpy to operate. Pandas provide an easy way to create, manipulate and wrangle the data. Pandas is also an elegant solution for time series data.

                         Why use Pandas?


Data scientists use Pandas for its following advantages:
  • Easily handles missing data
  • It uses Series for one-dimensional data structure and DataFrame for multi-dimensional data structure
  • It provides an efficient way to slice the data
  • It provides a flexible way to merge, concatenate or reshape the data
  • It includes a powerful time series tool to work with

                         What is a data frame?

A data frame is a two-dimensional array, with labeled axes (rows and columns). A data frame is a standard way to store data.
Data frame is well-known by statistician and other data practitioners. A data frame is a tabular data, with rows to store the information and columns to name the information. For instance, the price can be the name of a column and 2,3,4 the price values.
Below a picture of a Pandas data frame:

    
ITEM
PRICE
0
A
2
1
B
3



                                What is a Series?

A series is a one-dimensional data structure. It can have any data structure like integer, float, and string. It is useful when you want to perform computation or return a one-dimensional array. A series, by definition, cannot have multiple columns. For the latter case, please use the data frame structure.
Series has one parameters:
Data: can be a list, dictionary or scalar value

pd.Series([1.,2.,3.])
    0    1.0
    1    2.0
    2    3.0


you create a Pandas series with a missing value for the third rows. Note, missing values in Python are noted "NaN." You can use numpy to create missing value: np.nan artificially.

Step 3) Using head function
step 4) Using tail function
Step 5) An excellent practice to get a clue about the data is to use describe(). It provides the counts, mean, std, min, max and percentile of the dataset.

df.describe()

Drop a column

You can drop columns using pd.drop()
df.drop(columns=['A', 'C'])        

Concatenation

You can concatenate two DataFrame in Pandas. You can use pd.concat()
import numpy as np
df1 = pd.DataFrame({'name': ['John', 'Smith','Paul'],
                     'Age': ['25', '30', '50']},
                    index=[0, 1, 2])
df2 = pd.DataFrame({'name': ['Adam', 'Smith' ],
                     'Age': ['26', '11']},
                    index=[3, 4]) 
df_concat = pd.concat([df1,df2]) 
df_concat

Drop_duplicates

If a dataset can contain duplicates information use, `drop_duplicates` is an easy to exclude duplicate rows. You can see that `df_concat` has a duplicate observation, `Smith` appears twice in the column `name.`
df_concat.drop_duplicates('name')

Sort values

You can sort value with sort_values
df_concat.sort_values('Age')

Rename: change of index

You can use rename to rename a column in Pandas. The first value is the current column name and the second value is the new column name.

df_concat.rename(columns={"name": "Surname", "Age": "Age_ppl"})

Comments

Popular posts from this blog

Pre-defined Function in python

Comments in Python