pandas module in python
Pandas is an opensource
library that allows to you perform data manipulation in Python. Pandas library
is built on top of Numpy, meaning Pandas needs Numpy to operate. Pandas provide
an easy way to create, manipulate and wrangle the data. Pandas is also an
elegant solution for time series data.
Why use Pandas?
Data
scientists use Pandas for its following advantages:
- Easily handles missing data
- It uses Series for one-dimensional data
structure and DataFrame for multi-dimensional
data structure
- It provides an efficient way to
slice the data
- It provides a flexible way to
merge, concatenate or reshape the data
- It includes a powerful time
series tool to work with
What is a data frame?
A data frame is a two-dimensional array, with labeled axes (rows
and columns). A data frame is a standard way to store data.
Data
frame is well-known by statistician and other data practitioners. A data frame
is a tabular data, with rows to store the information and columns to name the
information. For instance, the price can be the name of a column and 2,3,4 the
price values.
Below
a picture of a Pandas data frame:
ITEM
|
PRICE
|
|
0
|
A
|
2
|
1
|
B
|
3
|
What is a Series?
A
series is a one-dimensional data structure. It can have any data structure like
integer, float, and string. It is useful when you want to perform computation
or return a one-dimensional array. A series, by definition, cannot have
multiple columns. For the latter case, please use the data frame structure.
Series
has one parameters:
Data: can be a list, dictionary or scalar value
pd.Series([1.,2.,3.])
0 1.0
1 2.0
2 3.0
you create a Pandas series with a missing value
for the third rows. Note, missing values in Python are noted "NaN."
You can use numpy to create missing value: np.nan artificially.
Step 3) Using head function
step 4) Using tail function
Step 5) An excellent practice to get a clue
about the data is to use describe(). It provides the counts, mean, std, min,
max and percentile of the dataset.
df.describe()
Drop a column
You
can drop columns using pd.drop()
df.drop(columns=['A', 'C'])
Concatenation
You can concatenate two DataFrame in Pandas.
You can use pd.concat()
import numpy as np
df1 = pd.DataFrame({'name': ['John', 'Smith','Paul'],
'Age': ['25', '30', '50']},
index=[0, 1, 2])
df2 = pd.DataFrame({'name': ['Adam', 'Smith' ],
'Age': ['26', '11']},
index=[3, 4])
df_concat = pd.concat([df1,df2])
df_concat
Drop_duplicates
If a dataset can contain duplicates information
use, `drop_duplicates` is an easy to exclude duplicate rows. You can see that
`df_concat` has a duplicate observation, `Smith` appears twice in the column
`name.`
df_concat.drop_duplicates('name')
Sort values
You
can sort value with sort_values
df_concat.sort_values('Age')
Rename: change of index
You can use rename to rename a column in Pandas. The first value
is the current column name and the second value is the new column name.
df_concat.rename(columns={"name": "Surname", "Age": "Age_ppl"})
Comments
Post a Comment