Python for Data Science: From Scratch

Python for Data Science: From Scratch

Learning about Data Structures and important packages like Numpy and Pandas in Python.

This article is the second piece in the Python For Data Science Series. In case you haven’t gone through the introduction of Python(part 1), go ahead and skim through that article here. After knowing about the basics, its time to indulge in more challenging topics of Python. In this article, we shall be looking into Python’s usage for data representation and manipulation methods. The following topics will be covered:

  1. Data Structures: Tuple, List, Dictionary.
  2. Packages: Numpy, Pandas.

SPOILER ALERT: You can expect a bonus at the end of the article!🤫


PART 1. Data Structures

Python provides simple yet proficient data structures that help the data people to represent, manipulate and store different form of data as per their need. Let’s get comfortable with these subtle structures one by one.

1.1 Tuple:

A tuple is an immutable and fixed-length sequence of Python objects.

#Declaring a tuple
tup = 4,5,6
print(tup)
#output : (4, 5, 6)

#Advisable to use parenthesis
tup = (4,5,6)
print(tup)
#Output: (4, 5, 6)

#nested tuples
nested_tup = ((1,2,3),4,5,6)
print(nested_tup)
#output: ((1, 2, 3), 4, 5, 6)

#access the data indiviually by enclosing index in square brackets
print(nested_tup[1])
#output : 4


#default index begins from 0
print(tup[0])
#output: 4
print(nested_tup[0])
#output: (1, 2, 3)

#convert any sequence using tuple()
tup = tuple([4,8,7])
print(tup)
#output: (4, 8, 7)

#String representation
tup = tuple('Ramayana')
print(tup)
#o/p: ('R', 'a', 'm', 'a', 'y', 'a', 'n', 'a')
print(tup[4])
#o/p: y
  • You can simply create a tuple with a comma-separated sequence of values. Not always mandatory, but it is recommended to enclose the sequence within parenthesis.
  • Like many other languages, the default index begins from 0.
  • Elements of the tuples can be accessed by stating the required index number in square brackets ‘[<index_number>]’
  • You can also create a tuple of tuples called a nested tuple.
  • If you have a string or a sequence that you want to convert into a tuple, simply use tuple() method.

Tuple Manipulation: A tuple is immutable that is once you define a tuple’s contents, you cannot modify it.

#declare a tuple as below
tup = tuple(['foo',[1,2,3],True])
print(tup)
#output: ('foo', [1, 2, 3], True)

#Let's try to change the objects in this tuple
tup[2] = False
#output: -------------------------
#TypeError Traceback (most recent call last)
#<ipython-input-2-b89d0c4ae599> in <module>()
#----> 1 tup[2] = False
#TypeError: 'tuple' object does not support item assignment

#a tuple object which is mutable can be changed
tup[1].append(7)
print(tup)
#output: ('foo', [1, 2, 3, 7], True)

#tuple concatenation
tup1 = (1,2,3)
tup2 = ('a','b','c')
tup1 + tup2
#output: (1, 2, 3, 'a', 'b', 'c')
('a','c','d') + (2,5,6,4)
#output: ('a', 'c', 'd', 2, 5, 6, 4)

#tuple multiplication
tup * 2
#output : ('foo', [1, 2, 3, 7], True, 'foo', [1, 2, 3, 7], True)
tup * 0
#output : ()
  • Though the objects stored in a tuple may be mutable themselves, the tuple in itself remains immutable.
  • If an object inside a tuple is mutable, such as a list, you can modify it in-place.
  • You can concatenate two tuples by using the ‘+’ operator. This operator simple appends one tuple at the tail of another tuple.
  • Multiplying a tuple with an integer simply concatenates that number of copies of the tuple.

Unpacking Tuples: While unpacking. the values are assigned to the relevant rightmost variable.

  • Consider the example shown beside, the values get assigned one-by-one into the series of variables.
  • The nested tuples can be unpacked similarly. If you don’t represent the variables in proper parenthesis, the tuple values in one parenthesis of the nested tuple will be assigned to a single variable.
  • The rest operator is used to emphasize the assignment of a few variables. the rest operator takes care of the rest of the values. Look over the example for more clarity on how it works.
  • The last one is the count() method which as the name says clearly is used to count the frequency of repetitive elements.
#Value is assigned to the relevant rightmost variable
tup = (4,5,6)
a,b,c = tup
print(a,b,c)
#output: 4 5 6

#nested tuple unpacking
tup = (4,5,6,(1,2))
a,b,c,(d,e) = tup
print(a,e,d)
#output: 4 2 1
 
  
#rest opertor
temp = (4,5,6,7,1,2,3,7,8,9,0,11,2,3,43,54,66)
a,b,c,*rest = temp
print(a)
# o/p: 4
print(c)
#o/p: 6
print(rest)
#o/p: [7, 1, 2, 3, 7, 8, 9, 0, 11, 2, 3, 43, 54, 66]

#count operator
tup = (2,2,3,5,2,4,2,4,1,0)
print(tup.count(2))
#o/p: 4
print(tup.count(7))
#o/p : 0
print(tup.count(5))
#o/p : 1

1.2 List:

A list is a variable-length and mutable data structure.

  • You can define a list using square brackets [] or using the list type function.
  • As in tuple, the index in the list also starts with zero and ends at the length of the list.
#define a list using square brackets
list1 = [2,3,4,'clap',None,False]
print(list1)
#o/p: [2, 3, 4, 'clap', None, False]

#accessing list elements
print(list1[0])
#o/p: 2
print(list1[5])
#o/p: False


#declare a tuple 
tup = ('Sneeze','Wipe','Breathe')
#convert the tuple to list 
tup_to_list = list(tup)

print(tup)
#o/p:('Sneeze', 'Wipe', 'Breathe')
print(type(tup))
#o/p: <class 'tuple'>
print(tup_to_list)
#o/p: ['Sneeze', 'Wipe', 'Breathe']
print(type(tup_to_list))
#o/p: <class 'list'>

Adding and removing elements:

#modifying elements in-place
a_list = ['peek','a','boo']
a_list[2] = 'bawww'
print(a_list)
#o/p: ['peek', 'a', 'bawww']

#append method
a_list.append('please')
print(a_list)
#o.p: ['peek', 'a', 'bawww', 'please']

#insert method
a_list.insert(0,'hey')
print(a_list)
#o/p: ['hey', 'peek', 'a', 'bawww', 'please']

#removing element from specific place
a_list.pop(3)
print(a_list)
#o/p: ['hey', 'peek', 'a', 'please']

#using the remove method
long_list = ['hey','hello','namaste','pranam','hi','namaskar','hola','bonjour','salaam','Marhaba','Hallo','Hoi' ]
print(long_list)
#o/p: ['hey', 'hello', 'namaste', 'pranam', 'hi', 'namaskar', 'hola', 'bonjour', 'salaam', 'Marhaba', 'Hallo', 'Hoi']
long_list.remove('hola')
print(long_list)
#o/p: ['hey', 'hello', 'namaste', 'pranam', 'hi', 'namaskar', 'bonjour', 'salaam', 'Marhaba', 'Hallo', 'Hoi']
  • Elements can be appended to the end of the list using the append function.
  • Using the insert function you can insert the element at a specified location within the list. The insert function tends to be more expensive than the append function as the interpreter has to internally shift the previous elements to make room for the current element.
  • Or if you simply want to modify an element in the list, you can change it in-place as depicted in the code.
  • Regarding removal of elements, the inverse of insert function is the pop method.
  • You simply need to specify the index of the element you want to pop and it will be removed from the list, shifting the elements to the right of the deleted element to the left.
  • Suppose the list is a long one, and you don’t know the index of the value you want to remove, you can use the remove keyword which locates the first such value and removes it from the list.

Combining and Concatenating a list:

#concatenating using + operator
[1,2,3] + ['a','b','c']
#o/p: [1, 2, 3, 'a', 'b', 'c']

#extend method
a = ['what','how','why']
a.extend(['where','when','who'])
print(a)
#o/p: ['what', 'how', 'why', 'where', 'when', 'who']

Similar to tuples, adding two lists together with a + concatenates them. Or else, if you already have a list defined, you can append multiple elements using the extend method. While using extend operator be sure to put the values inside square brackets.

Slicing: Slicing is selecting sections of your defined list as per your choice.

Remember these indices to understand slicing!
  • You can extract chunks of data from your list by specifying the slice notation, which in its basic form is start_index:stop_index passed to indexing operator[].
#slicing a list using [] operator
seq = ['E','N','J','O','Y','L','I','F','E','<3']
print(seq)
#o/p: ['E', 'N', 'J', 'O', 'Y', 'L', 'I', 'F', 'E', '<3']

#slice the list using the inclusive start_index and exclusive stop_index
#syntax: list_name[start_index:stop_index]
seq[2:5]
#o/p: ['J', 'O', 'Y']

#an empty start or end index implies beginning or end resp.
seq[:5]
#o/p: ['E', 'N', 'J', 'O', 'Y']
seq[5:]
#o/p: ['L', 'I', 'F', 'E', '<3']

#use negative indices that start at -1 from end of list
seq[-1:] #start slicing from '-1th' element until farthest end
#o/p: ['<3']
seq[:-3]  #start slicing from beginning and stop at element before '-3rd' index
#o/p: ['E', 'N', 'J', 'O', 'Y', 'L', 'I']
seq[-8:-2] #start slicing from '-6th' element and go upto '-2th' element 
#o/p: ['J', 'O', 'Y', 'L', 'I', 'F']

#use a doube colon to imply "everyother"
seq[::2]
#o/p: ['E', 'J', 'Y', 'I', 'E']
seq[::-3]
#o/p: ['<3', 'I', 'O', 'E']


#reverse the list in a stud way
seq[::-1]
#o/p: ['<3', 'E', 'F', 'I', 'L', 'Y', 'O', 'J', 'N', 'E']
  • The start index is inclusive in the output while the stop index is not included in the output so that the number of elements in the result is stop-start.
  • Either the start or stop index can be omitted, in which case they default to the start of the sequence and the end of the sequence respectively.
  • Negative indices slice the sequence relative to the end.
  • Use a double colon to print ‘every other’ say my_list[::2] print ‘every other second element’ of my_list.
  • Use a double colon to print the reverse of a list like a python stud(see code)! 😎

Note: Practice slicing really hard. the more you practice the better you understand the indices. While negative indexing always remember to start the last element with ‘-1’ and continue up to the first element.


1.3: Dict:

A dict or Dictionary in Python is an unordered, mutable and indexed data structure. It is considered the most important and detailed data structure in Python.

#create a dict using curly braces
my_dict = {}
print(type(my_dict))
print(my_dict)
#o/p: <class 'dict'>
#o/p: {}

#defining key:value pairs
my_dict = {
    'Movie name' : 'The Shawshank Redemption',
    'Genre' : 'Drama/Mystery',
    'Release Year': 1994,
    'Cast' : ['Tim Robbins','Morgan Freeman','Bob Gunton'],
    'Director' : 'Frank Darabont'    
}
print(my_dict)
#o/p: {'Movie name': 'The Shawshank Redemption', 'Genre': 'Drama/Mystery', 
      #'Release Year': 1994, 
      #'Cast': ['Tim Robbins', 'Morgan Freeman', 'Bob Gunton'], 
      #'Director': 'Frank Darabont'}

#accessing individual elements      
print(my_dict['Movie name'])
#o/p: The Shawshank Redemption

#accessing a list of values
print(my_dict['Cast'])
#o/p: ['Tim Robbins', 'Morgan Freeman', 'Bob Gunton']
print(my_dict['Cast'][0])
#o/p: Tim Robbins
print(my_dict['Cast'][2])
#o/p: Bob Gunton
  • A dict or Dictionary in Python is an unordered, mutable and indexed data structure.
  • The data is represented as key: value format in a dictionary. Here both key and value are python objects.
  • Each key has a value or a list of values assigned to it.
  • Each key, value pair should be separated using a comma.
  • You can create an empty dict using curly braces ‘{}’.
  • If you want to access a particular value simply pass the key data in the square brackets of your dict.
  • If you have a list of values like the ‘Cast’ key in my_dict, and if you want to access a particular element of the list, simply use the index value in square brackets that follow a key declaration in square brackets, E.g: my_dict[‘cast’][0]
#define a dict
dict1 = {
    'Country':'India',
    'State':'Maharashtra',
    'District':'Mumbai',
    'City' : 'Mulund',
    'Pincode' : '400080'
}
#add a new key to the dict 
dict1['Street']='M.G. Road'
print(dict1)
#o/p: {'Country': 'India', 'State': 'Maharashtra', 'District': 'Mumbai', 'City': 'Mulund', 'Pincode': '400080', 'Street': 'M.G. Road'}

#check if an item is present in dict or no
'City' in dict1
#o/p: True

#delete using del keyword
del dict1['Street']

#pop method
ret = dict1.pop('Pincode')
print(ret)
#o/p: 400080

print(dict1)
#o/p: {'Country': 'India', 'State': 'Maharashtra', 'District': 'Mumbai', 'City': 'Mulund'}
  • To add a new key: value pair in a dictionary, simply add a new key using the square brackets and assign the given value to it.
  • To check if an element is present in the dictionary or not, using the in keyword. <Type your key here> in <dictionary name>.
  • If the given key is present in the dictionary, the output is True otherwise it is False.
  • To delete a pair from the dict, you can use del keyword. This keyword deletes the entire key record. If you don’t specify a key, the entire dict willl be deleted.
  • The second option is to use the pop method, which removes the key from the dictionary and also returns the value that was assigned to that key.
  • Below are some general methods that could be handy while dealing with a dictionary:
#declare a dictionary
dict1 = {
    'Country':'India',
    'State':'Maharashtra',
    'District':'Mumbai',
    'City' : 'Mulund',
    'Pincode' : '400080',
    'Street' : 'M.G. Road',
    'Locality': 'Vrindavan Society',
    'Landmark' : 'Royal Treat Cafe',
    'Apartment Number' : '101-A'
}

#access the keys 
dict1.keys()
#o/p: dict_keys(['Country', 'State', 'District', 'City', 'Pincode', 'Street', 
#'Locality', 'Landmark', 'Apartment Number'])

#access the values
dict1.values()
#o/p: dict_values(['India', 'Maharashtra', 'Mumbai', 'Mulund', '400080', 'M.G. Road',
#'Vrindavan Society', 'Royal Treat Cafe', '101-A'])

#display all the items
dict1.items()
#o/p: dict_items([('Country', 'India'), ('State', 'Maharashtra'), ('District', 'Mumbai'),
#('City', 'Mulund'), ('Pincode', '400080'), ('Street', 'M.G. Road'),
#('Locality', 'Vrindavan Society'), ('Landmark', 'Royal Treat Cafe'), 
#('Apartment Number', '101-A')])

#copy current dict into new dict
x = dict1.copy()
print(x)
#o/[p: {'Country': 'India', 'State': 'Maharashtra', 'District': 'Mumbai', 'City': 
#'Mulund', 'Pincode': '400080', 'Street': 'M.G. Road', 'Locality': 'Vrindavan Society',
#'Landmark': 'Royal Treat Cafe', 'Apartment Number': '101-A'}

#get the value of a key
dict1.get('City')
#o/p: Mulund

#clear the dictionary
dict1.clear()
print(dict1)
#o/p: {}

Here I announce that we have studied the basic data structures that we use in Python for analysis that don’t require a package to be imported from. Next we focus on some important packages that are known to be quite helpful in analysis.


PART 2: Packages in Python

Python is an open-source project has a contribution from thousands of developers who have uploaded numerous packages that help us perform activities by simply importing that package and using it. The following is a study of a few of those packages.

2.1. NumPy:

Numpy, short for Numerical Python, is one of the most important packages for numerical computation in Python. Most of the packages that provide scientific functionality use NumPy’s array objects as the lingua franca for data exchange.

#standard convention of importing numpy
import numpy as np  
data = np.random.randn(3,3) #generate a random array of size 3 x 3
#o/p: [[ 0.20158825 -0.11349001 -0.61041965]
#[ 0.16453124  1.16433185 -2.97683373]
#[ 2.2510491   0.47131988 -0.9696766 ]]


#some opertaions on our array
print(data * 100)
print(data + data)

#Creating arrays from lists
my_list = [3,6,2,7,6,3,0]
my_array = np.array(my_list)
print(my_list,type(my_list))
#o/p: [3, 6, 2, 7, 6, 3, 0] <class 'list'>
print(my_array,type(my_array))
#o/p: [3 6 2 7 6 3 0] <class 'numpy.ndarray'>

#nested lists
array1 = np.array(list1)
print(array1)
#o/p: [[1 2 3 4]
#      [5 4 3 2]]

#create an array of 5 zeros of type int
np.zeros(5,dtype = int)

#create an ndarray of size 3 x 2 with all ones
np.ones((3,2),dtype = int)
#o/p: array([[1, 1],
#       [1, 1],
#       [1, 1]])

#create a 2 x 2 array filled with 90
np.full((2,2),90)
#o/p: array([[90, 90],
#       [90, 90]])

#create an array of sequence from 0 to 30 with interval of 3
np.arange(0,30,3) #not including 30 though
#o/p: array([ 0,  3,  6,  9, 12, 15, 18, 21, 24, 27])

#create an array of evenly spaced data between 0 and 1
np.linspace(0,1,5)
#o/p: array([0.  , 0.25, 0.5 , 0.75, 1.  ])

#create an identity matrix array 
np.eye(3)
#o/p: array([[1., 0., 0.],
#            [0., 1., 0.],
#            [0., 0., 1.]])

In some ways, NumPy’s array objects are much like Python’s built-in list type, but NumPy is a winner when we deal with colossal data. Knowing about NumPy in depth will be eventually valuable no matter what field of data science you’re interested in!

  • One of the key features of Numpy is the ndarray or N-dimensional array object which is a fast, flexible container for large datasets in Python.
  • Arrays enable the programmer to perform mathematical operations on whole chunks of data having similar syntax.
  • Let’s first import Numpy package and generate a small array of random numbers.
  • A ndarray is a generic multi-dimensional container for homogeneous data that is all the data points should be of the same type.
  • The easiest way to create an array is to use the array function. This function accepts any sequence-like object and produces a Numpy array containing the passed data, for e.g. a list.
  • Nested sequences, like a list of equal-length lists, will be converted into a multidimensional array.
  • Since the list in code example was a list of lists, the NumPy array has two dimensions with shape inferred from the data correctly.
  • Apart from np.array function, there are a number of other functions for creating new arrays. Please go through given in Numpy_intro.py to get an idea of various ways to declare an array.

Numpy Array Attributes:

import numpy as np
#seed for definite reproducibility
np.random.seed(0) 

#create a 1-D array using random number
#give me 7 random numbers between 5 to 10.
x1 = np.random.randint(5,10,size=7) 
print(x1)

#create a 2-D array 
#give me a 3 rows x 3 columns array 
#that has random integers between 5 to 10.
x2 = np.random.randint(5,10,size=(3,3))
print(x2)

#create a 3-D array
#give me a 3  x 4 x 4 array
x3 = np.random.randint(5,10,size=(3,4,4))
print(x3)

#using ndim fucntion
print(x1.ndim)
#o/p: 1
print(x2.ndim)
#o/p: 2
print(x3.ndim)
#o/p: 3

#print size of each dimension
print(x1.shape)
#o/p: (7,)
print(x2.shape)
#o/p: (3, 3)
print(x3.shape)
#o/p: (3, 4, 4)

#print total size
print(x1.size)
#o/p: 7
print(x2.size)
#o/p: 9
print(x3.size)
#o/p: 48

Let’s start with some basic array attributes that are used to obtain information about the array. First, we will create three arrays, one-dimensional, two-dimensional and three-dimensional random array.

  • We set seed so that the same random arrays are generated each time the code is run.
  • Use randint to generate random integers between low(inclusive) to high(exclusive). The format is numpy.random.randint(low,high,size,dtype).
  • Now its time to dig into some attributes of an array, the ndimattribute, used as <array_name>.ndim, produces the number of dimensions.
  • Then using the shape attribute we can find the size of each dimension and the size attribute determines the size of the entire array.

Accessing Array elements: If you are familiar with Python’s list indexing, indexing in Numpy will feel quite easy to understand. In a 1-D array, you can access the i-th value, starting from 0, by specifying i-th index into the indexing operator i.e []. For a multi-dimensional array, you are expected to specify the detailed location of an element using a comma-separated tuple of indices. Indexing and slicing are best understood when performed. I urge you to practice and think over every example given in the below gist.

import numpy as np
arr = np.array([4,5,6,7,1,2,3,0])

#accessing single element using indices
print(arr[0],arr[1],arr[6])
#o/p: 4 5 3

#accessing single element using negative indices
print(arr[-1],arr[-5])
#o/p: 0 7

#create a 3 x 3 array of random integers
arr1 = np.random.randint(0,5,(3,3))
#print the zeroth row and second column value
print(arr1[0,2]) 
#print the second last element from last row
print(arr1[-1,-2])

###############################################
#accessing subarrays 
##############1-D subarrays####################
arr = [4,5,6,7,3,2,1,0,5,6,7,7,8,4,3,3]
arr[:5]
#o/p: [4, 5, 6, 7, 3]

arr[5:]
#o/p: [2, 1, 0, 5, 6, 7, 7, 8, 4, 3, 3]

arr[4:7]
#o/p: [3, 2, 1]

#every other element
arr[::2]
#o/p: [4, 6, 3, 1, 5, 7, 8, 3]

#every other element starting at index 1
arr[1::2]
#o/p: [5, 7, 2, 0, 6, 7, 4, 3]

#reversed
arr[::-1]
#o/p: [3, 3, 4, 8, 7, 7, 6, 5, 0, 1, 2, 3, 7, 6, 5, 4]

#reversed everyother from index 3
arr[3::-2]
#o/p: [7, 5]

############ multi-dimensional array##############
#2-D array
arr1 = np.random.randint(0,10,(3,3))
print(arr1)
#o/p: [[4 8 1]
#      [7 9 4]
#      [1 3 3]]

#print upto 1st row and all columns
arr1[:2,:3]
#o/p: array([[4, 8, 1],
#            [7, 9, 4]])

#print everything from 1st row and 1st column  
arr1[1:,1:]
#o/p: array([[9, 4],
#            [3, 3]])

#print all rows and every other column
arr1[:3,::2]
#o/p: array([[4, 1],
#            [7, 4],
#            [1, 3]])

#print every other row with columns reversed 
arr1[::2,::-1]
#o/p:array([[1, 8, 4],
#           [3, 3, 1]])

#3-D array
#defining a 3 dimensional array of random integers between 0 to 10
arr2 = np.random.randint(0,10,(4,3,3))
#print first two dimensions
arr2[:2]
#now print all rows from these 2 dimensions
arr2[:2,:3]
#slice up the above output into just 2 columns
arr2[:2,:3,:2]
#print every other dimension and row with reversed columns
arr2[::2,::2,::-1]

2.2 Pandas:

Pandas is an open source library for Python that was particularly created for data manipulation and analysis of huge chunks of data. Pandas offers robust data structures and functions for manipulating data easily.

Photo by Debbie Molle on Unsplash

But wait, that’s what lists, dict and Numpy’s ndarrays could do too, So why Pandas? Well, there are many justified reasons. One of them is in most cases the data person will get data in csv file formats. Pandas is quite famous with easily recognizing columns headers automatically from these csv files. In short, Pandas is awesome! 🐼🤗 The two primary data structures in Pandas are Series(one Dimensional) and Dataframe(two dimensional).

First, we can use the Pandas library by calling the import method as

import pandas as pd

2.2.1 Series:

A Series is a one-dimensional array of indexed data that can hold data of any type. The axis labels are collectively referred to as index. The basic method to generate a series is :

my_series = pd.Series(data, index=index)

Where data can be a Python dict, ndarray or even a scalar value(say 7), indexcan be a list of axis labels. Thus, we will study a few cases depending on what data is-

import pandas as pd
import numpy as np

ser = pd.Series(np.random.randint(0,10,5))
ser
#o/p:
#0    1
#1    8
#2    9
#3    2
#4    6
#dtype: int32

index=['Tomatoes','Potato','Eggs','Onions','Garlic']
ser = pd.Series(np.random.randint(0,10,5),index)
ser
#o/p: 
#Tomatoes    9
#Potato      7
#Eggs        3
#Onions      8
#Garlic      9
#dtype: int32

##converting dict to series 
my_dict = {'d': 4,'n':6,'a':7}
dict_to_series = pd.Series(my_dict)
dict_to_series
#o/p:
#d    4
#n    6
#a    7
#dtype: int64

my_dict1 = {'a':7,'b':8,'c':10}
dict_to_ser = pd.Series(my_dict1,index = ['a','e','b','c','d'])
dict_to_ser
#o/p:
#a     7.0
#e     NaN
#b     8.0
#c    10.0
#d     NaN
#dtype: float64

##From Scalar value to Series
ser1 = pd.Series(5,index=['This','is','scalar','format'])
ser1
#o/p:
#This      5
#is        5
#scalar    5
#format    5
#dtype: int64

a. From ndarray:

If you are using a ndarray as data for series, the length of the index should be of the same length. If no index is defined, a default index will be created having values [0,…, len(data)-1].

b. From a dict:

As we have learned earlier, a dict is key: value representation of data. We can use a dict to define a Series object. Here if we don’t pass index values, the ‘key’ mentioned in the definition of the dict is used as the index. The key attribute appears in the index in the same order as defined in the dict. On the other hand, if the index is defined in Series definition, only the data that corresponds to the keys used in the dict are pulled out. The remaining index labels will be recorded as Nan, not a number, which is the standard data marker for missing data.

c. From a scalar value:

If data is a scalar value or a constant, then we definitely need to specify an index. The same constant value will be replicated up to the length of the index corresponding each index point.

The following notebook showcases a list of certain functions that dig information about the series object.

{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"#import the necessary libraries\n",
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/plain": [
"0    4\n",
"1    7\n",
"2    3\n",
"3    0\n",
"4    9\n",
"5    1\n",
"dtype: int64"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#define a series wih no index\n",
"ser = pd.Series([4,7,3,0,9,1])\n",
"ser"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The value attribute is used to extract the actual data stored in the Series. "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"array([4, 7, 3, 0, 9, 1], dtype=int64)"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ser.values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The index attribute is used to represent the index objects."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"RangeIndex(start=0, stop=6, step=1)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ser.index"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#You can play with the indices using the arithmetic operators\n",
"ser[1*2]"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ser[2+3]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"A    4\n",
"B    2\n",
"C    6\n",
"dtype: int64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Define a Series with specific index\n",
"ser1 = pd.Series([4,2,6],index = ['A','B','C'])\n",
"ser1"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['A', 'B', 'C'], dtype='object')"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ser1.index"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([4, 2, 6], dtype=int64)"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ser1.values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can manipulate the values directly by specifying the index and the new value."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"A    10\n",
"B     2\n",
"C     6\n",
"dtype: int64\n"
]
}
],
"source": [
"ser1['A'] = 10\n",
"print(ser1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can select multiple indices at once using a list of indices. The output will in the order of indices you mention in the list."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"C     6\n",
"A    10\n",
"dtype: int64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ser1[['C','A']]"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"A    10\n",
"C     6\n",
"dtype: int64"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#select values that are greater than 5  from Series \n",
"ser1[ser1>5]"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"A    100\n",
"B      4\n",
"C     36\n",
"dtype: int64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#multiply the contents of the two Series\n",
"ser1 * ser1"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"A    20\n",
"B    12\n",
"C    16\n",
"dtype: int64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#alter the content of Series by performing addition and subtraction \n",
"ser1 + 10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"you can check if a particular index value is present, you can use the in operator. Remember Python is case sensitive and you need to specify your index as is."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"'a' in ser1"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"'A' in ser1"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Mumbai       4.0\n",
"Delhi        6.0\n",
"Srinagar     NaN\n",
"Kolkata      3.0\n",
"Bangalore    8.0\n",
"dtype: float64"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"index = ['Mumbai','Delhi','Srinagar','Kolkata','Bangalore']\n",
"ser2 = pd.Series([4,6,np.NaN,3,8],index)\n",
"ser2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas uses 'NaN' as its standard representation for missing or unknown values. Missing values form an important aspect of data analysis. Often the data person needs to deal with the missing values to get a complete set of data points to perform further analysis. The following functions provide a way to detect all the null values in our Series object."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"Mumbai       False\n",
"Delhi        False\n",
"Srinagar      True\n",
"Kolkata      False\n",
"Bangalore    False\n",
"dtype: bool"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.isnull(ser2)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"Mumbai        True\n",
"Delhi         True\n",
"Srinagar     False\n",
"Kolkata       True\n",
"Bangalore     True\n",
"dtype: bool"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.notnull(ser2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we move ahead to manipulate two distinct series. When we use an arithmetic operator to perform some operation between 2 series, the common indices amongst the two series are considered for the operation. The rest are mentioned as NaN.                                      "
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"index = ['Delhi','Mumbai','Goa','Bangalore']\n",
"ser3 = pd.Series([5,7,3,1],index)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"Bangalore     9.0\n",
"Delhi        11.0\n",
"Goa           NaN\n",
"Kolkata       NaN\n",
"Mumbai       11.0\n",
"Srinagar      NaN\n",
"dtype: float64"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ser2 + ser3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The name attribute allows us to call a particular Series with a name. This attribute can also be used for the indices of a Series. "
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"ser2.name = 'Temperature'"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"ser2.index.name = 'City'"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"City\n",
"Mumbai       4.0\n",
"Delhi        6.0\n",
"Srinagar     NaN\n",
"Kolkata      3.0\n",
"Bangalore    8.0\n",
"Name: Temperature, dtype: float64"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ser2"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

2.2.2. Dataframe:

A dataframe is a two-dimensional indexed data structure with columns of different data types. It can be thought of as a Dict of Series all sharing the same index. A dataframe is the handiest and hence widely used data structure. You can think of a dataframe as a spreadsheet or a SQL table. Like Series, we can construct a dataframe in multiple ways as follows:

data = pd.Dataframe(data,index,columns)
A Dataframe

Where index and columns of your choice can be specified optionally. Otherwise, the default index is given as [0….(len-1)] as a mutual index to all columns. In the diagram given, we assigned index as 0 to 4, while columns are Book Title, Author, Finished Reading?. And everything filled within the table is data.

You can create a dataframe from possibly every other data structure(proceed here to study dataframe construction). To keep our study relevant to data science, I shall assure you that you will use dataframes as a container for large .csv files, .data files or any other file format that stores data. Therefore, we start with the most basic file format that you might work on, i.e. comma separated values and how to load such files.

data = pd.read_csv("‪path_to_file\file_name.csv")
A CSV file loaded into a dataframe

I downloaded a dataset from kaggle and read it using the above syntax. After reading the dataset looks like the figure beside. Let’s start with some basic handy commands that help analyze a dataframe.

import pandas as pd
import numpy as np
#define a series of dict to obtain a Dataframe
index=['Monday','Tuesday','Wednesday','Thursday','Friday']
dict_data = {'Banana': pd.Series([1,0,2,1,0],index),
'Apple' : pd.Series([2,1,0,2,1],index),
'Eggs':pd.Series([2,1,1,2,1],index)}
#converting data dictionary into dataframe
df = pd.DataFrame(dict_data)
#tell the number of rows and columns in dataframe
df.shape
#printing all column names
df.columns
#o/p: Index(['Banana', 'Apple', 'Eggs'], dtype='object')
#printing all indices
df.index
#o/p: Index(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'], dtype='object')
#print first 3 values
#change this 3 to any n value to list n first values
df.head(3)
#print last 3 values
#same as head()
df.tail(3)
#print the count,mean,std and the five number summary of dataset
df.describe()
#how to access specific columns
df.Eggs
df['Eggs']
#access the first row of this subset
df['Eggs'][0]
#access first till third element
df['Eggs'][0:3]

Indexing and Selecting Data:

The primary objective in this section is how to slice, dice and generally get and set subsets of pandas objects. Just like Numpy arrays, we can select and index data, the only difference is that in Series and Dataframe we can use a list or even a slice labels.

dataframe[0]                 #select the zeroth index data
dataframe[0:10]              #select data from zeroth to ninth index
dataframe[[0,5,6,4,7]]       #select specific index in a list
dataframe[dataframe.SOP > 3.0] #select all rows where SOP > 3.0

#import libraries
import pandas as pd
import numpy as np
#call the data into dataset
#insert your file's path here
data = pd.read_csv("F:\Projects\datasets\graduate-admissions\Admission_Predict.csv")
#using iloc to access the data subsets row-wise
data.iloc[5,]                #print the fifth row element of the dataframe
data.iloc[1:6]               #print the row element from first index upto fifth element
data.iloc[[5,10,15,20]]      #print alternate fifth element
data.iloc[[True,False,True]] #using boolean values print zeroth and second row
#using iloc to access the data subsets column-wise
data.iloc[:,5]               #print all rows from fifth column
data.iloc[:,2:5]             #print all rows from second to fourth column
data.iloc[:,[2,4,5]]         #print second,fourth and fifth column
data.iloc[:,[True,False,False,True]]        #print first then directly fourth column
data.iloc[3:9,1:5]            #print 3rd to 8th rows then 1st to 4th columns
data.iloc[[0,2,4,6,8,10],[1,3,5,7,]]   #print even rows and odd columns

We use .iloc to focus on the integer based positions or locations of data, starting from 0 up to length-1 of either axis. But it may also be used with a boolean array. iloc will raise an IndexError if it finds the mentioned index or slice of the index is out of bounds. It is important to use the colon operator in row selection if we are only planning to use column selection.

 #use the same datasource as in iloc example

  data.loc[0]               #print first row
data[0:6]                 #print first to fifth rows
data.loc[[2,5,6,4]]       #print 2nd,5th,6th and 4th rows
data.loc[[True,False,True]] #print first and third rows
data.loc[:,'SOP']                            #print all rows from SOP column
data.loc[:,'Serial No.':'Research']          #print all rows from 'Serial No' to 'Research'
data.loc[:,['Serial No.','SOP','CGPA']]      #print Serial No., SOP and CGPA only
data.loc[:,[True,False,True]]                #print first and third columns
#print the given element from SOP column
data.loc[7,'SOP']
#o/p: 3.0
#print all rows from third upto seventh row
#print the above rows from Serial No. to SOP 
data.loc[3:8,'Serial No.':'SOP']
#print specific rows and column list
data.loc[[1,3,5,7],['Serial No.','SOP','CGPA']]
#check all rows where chance of admit is greater than 75%
data.loc[:,'Chance of Admit '] > 0.75

On the other hand, loc is used for label based extraction of data. We have to specify the actual labels or index names and column names as parameters.

We can select a slice of such columns or even a boolean array. loc will raise a KeyError when the items are not found. We can specify the row index as 0 to length -1 if its an integer. Please note that here the integer is not a position but instead it a “label”. For columns, we are expected to use the respective column names, say ‘Column 1’,’Column 2’, etc.

Dropping Data from Axes:

#dropping data from Series
data = pd.Series([4,5,3,6,7],index=['dog','cat','rat','sparrow','pigeon'])
#drop the rat and cat data
new_data = data.drop(['rat','cat'])
data
'''O/p:
dog        4
cat        5
rat        3
sparrow    6
pigeon     7
dtype: int64
'''
new_data
'''O/p:
dog        4
sparrow    6
pigeon     7
dtype: int64
'''
#Dropping data from axes in dataframe
data = pd.DataFrame(np.random.randint(0,10,(4,3)),
index=['Aquaman', 'Batman', 'Superman','Wonder Woman'],
columns=['Power', 'Reflex', 'Skills'])
#delete the batman row
data.drop('Batman')
#delete the power column
data.drop('Power',axis=1)
data.drop('Skills',axis='columns')
#delete data from current dataframe and reflect changes in it
data.drop('Aquaman',inplace=True)
data.drop('Power',axis=1,inplace=True)

W can drop a particular data entry from the dataframe using the drop method. This drop method deletes the specified data point from the object and without mending the original data object, it just returns a new object which has the deletions reflected.

While using drop method in a dataframe, we can drop values from the row by simply mentioning them in quotes. Otherwise, if we want to wipe out a particular column, we have to specify an additional parameter, axis =1 or axis =’ columns’. (axis=0 implies rows)

Whenever we use drop, data is reflected in a new object while protecting the original data. If you want to mend the original data, use the parameter inplace=True. Be careful while using inplace as it will drop data from the original object permanently.

GroupBy method:

GroupBy function in pandas resembles the group by in SQL. It groups the entire dataframe keeping one particular column in consideration. After we have grouped the dataframe in this manner, we can perform various actions like minimum, maximum or finding mean on this collective data. Any groupby operation involves one of the following operations on the original data object:

  • Splitting the data based on some criteria.
  • Applying a function to each group independently.
  • Combining the results into a data structure.
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a dataframe with dummy data that has repeated categorical data to ensure grouping."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"df = pd.DataFrame({'Random food' : ['Chicken', 'Eggs', 'Sandwich','Potato', 'Eggs','Cheese cubes','Pizza','Oats',\n",
"                                    'Potato', 'Tomato', 'Eggs', 'Chicken','Cheese cubes','Sandwich',],\n",
"                   'Day of Week' : ['Monday', 'Tuesday', 'Monday', 'Wednesday','Sunday','Saturday','Sunday',\n",
"                             'Friday', 'Wednesday', 'Monday', 'Tuesday', 'Monday','Thursday', 'Thursday'],\n",
"                   'How many times/day?': np.random.randint(low=1,high=4,size=14),\n",
"                   'Quantity' : np.random.randint(low=1,high=5,size=14),\n",
"                   'Price in rupees' : np.random.randint(low=15,high=200,size=14),\n",
"                   'Healthy?': ['Yes','No','Yes','Yes','Yes','No','No','Yes','Yes','No','No','Yes','No','Yes']})"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"    .dataframe thead th {\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"  <thead>\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th>Random food</th>\n",
"      <th>Day of Week</th>\n",
"      <th>How many times/day?</th>\n",
"      <th>Quantity</th>\n",
"      <th>Price in rupees</th>\n",
"      <th>Healthy?</th>\n",
"    </tr>\n",
"  </thead>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th>0</th>\n",
"      <td>Chicken</td>\n",
"      <td>Monday</td>\n",
"      <td>1</td>\n",
"      <td>2</td>\n",
"      <td>124</td>\n",
"      <td>Yes</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>1</th>\n",
"      <td>Eggs</td>\n",
"      <td>Tuesday</td>\n",
"      <td>2</td>\n",
"      <td>1</td>\n",
"      <td>41</td>\n",
"      <td>No</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>2</th>\n",
"      <td>Sandwich</td>\n",
"      <td>Monday</td>\n",
"      <td>1</td>\n",
"      <td>1</td>\n",
"      <td>192</td>\n",
"      <td>Yes</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>3</th>\n",
"      <td>Potato</td>\n",
"      <td>Wednesday</td>\n",
"      <td>1</td>\n",
"      <td>3</td>\n",
"      <td>107</td>\n",
"      <td>Yes</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>4</th>\n",
"      <td>Eggs</td>\n",
"      <td>Sunday</td>\n",
"      <td>2</td>\n",
"      <td>3</td>\n",
"      <td>113</td>\n",
"      <td>Yes</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>5</th>\n",
"      <td>Cheese cubes</td>\n",
"      <td>Saturday</td>\n",
"      <td>1</td>\n",
"      <td>3</td>\n",
"      <td>169</td>\n",
"      <td>No</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>6</th>\n",
"      <td>Pizza</td>\n",
"      <td>Sunday</td>\n",
"      <td>2</td>\n",
"      <td>4</td>\n",
"      <td>53</td>\n",
"      <td>No</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>7</th>\n",
"      <td>Oats</td>\n",
"      <td>Friday</td>\n",
"      <td>1</td>\n",
"      <td>2</td>\n",
"      <td>86</td>\n",
"      <td>Yes</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>8</th>\n",
"      <td>Potato</td>\n",
"      <td>Wednesday</td>\n",
"      <td>3</td>\n",
"      <td>1</td>\n",
"      <td>37</td>\n",
"      <td>Yes</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>9</th>\n",
"      <td>Tomato</td>\n",
"      <td>Monday</td>\n",
"      <td>1</td>\n",
"      <td>1</td>\n",
"      <td>43</td>\n",
"      <td>No</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>10</th>\n",
"      <td>Eggs</td>\n",
"      <td>Tuesday</td>\n",
"      <td>1</td>\n",
"      <td>1</td>\n",
"      <td>49</td>\n",
"      <td>No</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>11</th>\n",
"      <td>Chicken</td>\n",
"      <td>Monday</td>\n",
"      <td>2</td>\n",
"      <td>2</td>\n",
"      <td>51</td>\n",
"      <td>Yes</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>12</th>\n",
"      <td>Cheese cubes</td>\n",
"      <td>Thursday</td>\n",
"      <td>2</td>\n",
"      <td>3</td>\n",
"      <td>179</td>\n",
"      <td>No</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>13</th>\n",
"      <td>Sandwich</td>\n",
"      <td>Thursday</td>\n",
"      <td>1</td>\n",
"      <td>4</td>\n",
"      <td>15</td>\n",
"      <td>Yes</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"     Random food Day of Week  How many times/day?  Quantity  Price in rupees  \\\n",
"0        Chicken      Monday                    1         2              124   \n",
"1           Eggs     Tuesday                    2         1               41   \n",
"2       Sandwich      Monday                    1         1              192   \n",
"3         Potato   Wednesday                    1         3              107   \n",
"4           Eggs      Sunday                    2         3              113   \n",
"5   Cheese cubes    Saturday                    1         3              169   \n",
"6          Pizza      Sunday                    2         4               53   \n",
"7           Oats      Friday                    1         2               86   \n",
"8         Potato   Wednesday                    3         1               37   \n",
"9         Tomato      Monday                    1         1               43   \n",
"10          Eggs     Tuesday                    1         1               49   \n",
"11       Chicken      Monday                    2         2               51   \n",
"12  Cheese cubes    Thursday                    2         3              179   \n",
"13      Sandwich    Thursday                    1         4               15   \n",
"\n",
"   Healthy?  \n",
"0       Yes  \n",
"1        No  \n",
"2       Yes  \n",
"3       Yes  \n",
"4       Yes  \n",
"5        No  \n",
"6        No  \n",
"7       Yes  \n",
"8       Yes  \n",
"9        No  \n",
"10       No  \n",
"11      Yes  \n",
"12       No  \n",
"13      Yes  "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas objects can be split on either of the axes. The abstract definition of grouping id to provide a mapping of labels to group names. To create a GroupBy object, you may do the following:\n",
"grouped = data.groupby(key)\n",
"grouped = data.groupby([key1, key2])"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"grouped=df.groupby('Healthy?')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"name attribute displays all data grouped under one group. same representation is repeated for all possible groups."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"No\n",
"     Random food Day of Week  How many times/day?  Quantity  Price in rupees  \\\n",
"1           Eggs     Tuesday                    2         1               41   \n",
"5   Cheese cubes    Saturday                    1         3              169   \n",
"6          Pizza      Sunday                    2         4               53   \n",
"9         Tomato      Monday                    1         1               43   \n",
"10          Eggs     Tuesday                    1         1               49   \n",
"12  Cheese cubes    Thursday                    2         3              179   \n",
"\n",
"   Healthy?  \n",
"1        No  \n",
"5        No  \n",
"6        No  \n",
"9        No  \n",
"10       No  \n",
"12       No  \n",
"Yes\n",
"   Random food Day of Week  How many times/day?  Quantity  Price in rupees  \\\n",
"0      Chicken      Monday                    1         2              124   \n",
"2     Sandwich      Monday                    1         1              192   \n",
"3       Potato   Wednesday                    1         3              107   \n",
"4         Eggs      Sunday                    2         3              113   \n",
"7         Oats      Friday                    1         2               86   \n",
"8       Potato   Wednesday                    3         1               37   \n",
"11     Chicken      Monday                    2         2               51   \n",
"13    Sandwich    Thursday                    1         4               15   \n",
"\n",
"   Healthy?  \n",
"0       Yes  \n",
"2       Yes  \n",
"3       Yes  \n",
"4       Yes  \n",
"7       Yes  \n",
"8       Yes  \n",
"11      Yes  \n",
"13      Yes  \n"
]
}
],
"source": [
"for name,group in grouped:\n",
"    print(name)\n",
"    print(group)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"get_group('groupname') returns all the data under that particular group"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"   Random food Day of Week  How many times/day?  Quantity  Price in rupees  \\\n",
"0      Chicken      Monday                    1         2              124   \n",
"2     Sandwich      Monday                    1         1              192   \n",
"3       Potato   Wednesday                    1         3              107   \n",
"4         Eggs      Sunday                    2         3              113   \n",
"7         Oats      Friday                    1         2               86   \n",
"8       Potato   Wednesday                    3         1               37   \n",
"11     Chicken      Monday                    2         2               51   \n",
"13    Sandwich    Thursday                    1         4               15   \n",
"\n",
"   Healthy?  \n",
"0       Yes  \n",
"2       Yes  \n",
"3       Yes  \n",
"4       Yes  \n",
"7       Yes  \n",
"8       Yes  \n",
"11      Yes  \n",
"13      Yes  \n"
]
}
],
"source": [
"print(grouped.get_group('Yes'))"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"grouped1 = df.groupby('Day of Week')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An aggregated function returns a single aggregated value for each group. Once the group by object is created, several aggregation operation can be performed on the collection of data.\n",
"In the example below, agg method is used compute mean of Price in Rupees on the groups created by grouped1."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Day of Week\n",
"Friday        86.0\n",
"Monday       102.5\n",
"Saturday     169.0\n",
"Sunday        83.0\n",
"Thursday      97.0\n",
"Tuesday       45.0\n",
"Wednesday     72.0\n",
"Name: Price in rupees, dtype: float64\n"
]
}
],
"source": [
"print(grouped1['Price in rupees'].agg(np.mean))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also perform a number of opeartions in one go."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"              mean  amax  sum\n",
"Day of Week                  \n",
"Friday        86.0    86   86\n",
"Monday       102.5   192  410\n",
"Saturday     169.0   169  169\n",
"Sunday        83.0   113  166\n",
"Thursday      97.0   179  194\n",
"Tuesday       45.0    49   90\n",
"Wednesday     72.0   107  144\n"
]
}
],
"source": [
"print(grouped1['Price in rupees'].agg([np.mean,np.max,np.sum]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"size function represents the size of each group."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Day of Week\n",
"Friday       1\n",
"Monday       4\n",
"Saturday     1\n",
"Sunday       2\n",
"Thursday     2\n",
"Tuesday      2\n",
"Wednesday    2\n",
"Name: Quantity, dtype: int32\n"
]
}
],
"source": [
"print(grouped1['Quantity'].agg(np.size))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Transformation on a group or a column returns an object that is indexed the same as the one being grouped.Look at the example for better understanding.  "
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"score = lambda x: (x / x.sum())*100"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"    How many times/day?    Quantity  Price in rupees\n",
"0             20.000000   33.333333        30.243902\n",
"1             66.666667   50.000000        45.555556\n",
"2             20.000000   16.666667        46.829268\n",
"3             25.000000   75.000000        74.305556\n",
"4             50.000000   42.857143        68.072289\n",
"5            100.000000  100.000000       100.000000\n",
"6             50.000000   57.142857        31.927711\n",
"7            100.000000  100.000000       100.000000\n",
"8             75.000000   25.000000        25.694444\n",
"9             20.000000   16.666667        10.487805\n",
"10            33.333333   50.000000        54.444444\n",
"11            40.000000   33.333333        12.439024\n",
"12            66.666667   42.857143        92.268041\n",
"13            33.333333   57.142857         7.731959\n"
]
}
],
"source": [
"print(grouped1.transform(score))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Filter function as the name saya filters out all the data that doesn't fit in the given criteria. "
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"  Random food Day of Week  How many times/day?  Quantity  Price in rupees  \\\n",
"6       Pizza      Sunday                    2         4               53   \n",
"7        Oats      Friday                    1         2               86   \n",
"9      Tomato      Monday                    1         1               43   \n",
"\n",
"  Healthy?  \n",
"6       No  \n",
"7      Yes  \n",
"9       No  \n"
]
}
],
"source": [
"print(df.groupby('Random food').filter(lambda x: len(x) <= 1))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Reshaping using PivotTable:

Create a spreadsheet-style pivot table as a DataFrame. Data is often stored in CSV files or databases in so-called “stacked” or “record” format. Similarly, we can use the pivot function to quickly summarize our data. For understanding the application of pivot tables, I decided to work on an actual dataset. So go ahead and download the Black Friday dataset from Kaggle here. This dataset records more than 50,000 records of consumer purchases on a Black Friday Sale. Say you want an answer to what is the total purchase done by men and women of each age-group? Or what is the mean purchase done by men and women of different city categories?

Simple, sketch a pivot table with the code below,

df = pd.read_csv('YOUR_FILEPATH_HERE/BlackFriday.csv')
#create a pivot table
table = pd.pivot_table(df,values = 'Purchase', index=['Gender'],columns='Age',aggfunc=np.sum)
#create a pivot table with nested index
table1 = pd.pivot_table(df,values = 'Purchase', index=['Gender','City_Category'],columns='Age',aggfunc=np.mean)

The syntax of pivot_table has four important parameters, one is values which is the column or list of columns to be aggregated. The index and columns can be a single column or a list of columns. The aggfunc is a function or list of functions that are applied to aggregate data. We can use mean, min, max, sum, median and many more functions on our aggregate.

Working with Missing data:

Many times the data that we have to work on contains missing values, i.e. either the data was never recorded or it simply didn’t exist. By “missing” we mean NA(‘not available’). While NaN is the default missing value marker, we need to be able to easily detect this value in all kinds of different types: float, int, boolean, general object. In many cases, however, there are chances of getting ‘None’ which also should be considered as missing or not available.

#Use the same Black Friday dataset
df['Product_Category_2'].isna()     #decide which rows have Nan values in Product_category_2 column.
#that seems a long list if trues and falses 
#we can derive nothing out of it.
#let's summarize this output
df['Product_Category_2'].isna().value_counts() 
#we get a count of each True and False category
#O/p: 
#True     373299
#False    164278
#Name: Product_Category_3, dtype: int64
#Count all values that are not equal to Nan
df['Age'].notna().value_counts()
#count all values that are not Nan
df['Product_Category_3'].notna().value_counts()
#we get around 3 lakh Nan values are present in this column
#o/p:
#False    373299    
#True     164278
#Name: Product_Category_3, dtype: int64
#fill data using fillna() method
df['Product_Category_2'].fillna(0)
#OR
df['Product_Category_2'].fillna('missing')
#delete the entire column that has nan 
df.dropna(axis=0)    #drop all rows that have nan values
df.dropna(axis=1)    #drop all columns that have nan values
df['Product_Category_2'].dropna() #drop specific column
#using interpolate method to fill nan values
s = pd.Series([1,3,5,7,9,np.nan,13])
s.interpolate()
#o/p:
#0     1.0
#1     3.0
#2     5.0
#3     7.0
#4     9.0
#5    11.0
#6    13.0
#dtype: float64
s = pd.Series([0,5,10,15,20,25,np.nan,30])
s.interpolate()
#o/p:
#0     0.0
#1     5.0
#2    10.0
#3    15.0
#4    20.0
#5    25.0
#6    27.5
#7    30.0
#dtype: float64
  • isna() checks the given data object for missing values. If there are any nan values, then the command returns True otherwise False.
  • Likewise, notna() checks for true or actual values. it returns true if any true data is found and false otherwise.
  • We use the value_counts() method to count the number of instances in each case.
  • Sometimes while working on the data, it is necessary to level up the missing data with the actual data, so we use fillna() to fill in the missing values and then continue our analysis. We can replace the nans with mean, mode or median of the relevant data object.
  • Sometimes, when the data has too many missing or nan values, it is advisable to neglect the entire data object. For this, we use dropna(). The method dropna() can be used in three ways as mentioned in the beside gist, the parameter axis=0 drops all the rows of the entire dataset that has nan values. axis = 1 drops all columns that have nan values. Using a particular column name with the dataframe definition drops the entire column without affecting the adjacent data points.
  • interpolate() method is used to interpolate meaning data into the missing values. Refer thisdocument for further reading.

Concatenate Data Objects:

Often there will be a need to combine or concatenate two Series or two Dataframes. In this Section, you would learn to do the same. I personally use this method when I have to combine training and testing samples into one single dataframe. So go ahead and practice the below commands in your own notebook.

{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Concatenate Series "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"concatenation of Series and DataFrame in Pandas is a flexible way to combine objects along a particular axis. We can also add a tier of hierarchical indexing on concatenation axis, which may be useful if the labels are same on the passed axis number."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Concatenate two Series object based on axis = 0 "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0           Rose\n",
"1           Lily\n",
"0    Shoe Flower\n",
"1          Aster\n",
"dtype: object"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1 = pd.Series(['Rose','Lily'])\n",
"s2 = pd.Series(['Shoe Flower','Aster'])\n",
"pd.concat([s1,s2])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If ignore_index is True, then no matter what indices the two to-be concatenated objects, they will be numbered as starting from 0 to (len-1). This is useful if you are concatenating two objects who have no necessarily meaningful information in the index. "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"0           Rose\n",
"1           Lily\n",
"2    Shoe Flower\n",
"3          Aster\n",
"dtype: object"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([s1,s2],ignore_index=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Construct a hierarchical index using keys parameter to distinguish between the indices. This index will be used to represent keys on the outermost level. If you provide keys, there won't be a need for ignore_index."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Tier 1  0           Rose\n",
"        1           Lily\n",
"Tier 2  0    Shoe Flower\n",
"        1          Aster\n",
"dtype: object"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([s1,s2],keys=['Tier 1','Tier 2'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can give a name to each level of index using names parameter."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"Tier level  Row ID\n",
"Tier 1      0                Rose\n",
"            1                Lily\n",
"Tier 2      0         Shoe Flower\n",
"            1               Aster\n",
"dtype: object"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([s1,s2],keys=['Tier 1','Tier 2'],names=['Tier level','Row ID'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Concatenate DataFrames"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"df1 = pd.DataFrame([['a',97],['b',98],['c',99]],columns=['Letter','ASCII value'])"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"    .dataframe thead th {\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"  <thead>\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th>Letter</th>\n",
"      <th>ASCII value</th>\n",
"    </tr>\n",
"  </thead>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th>0</th>\n",
"      <td>a</td>\n",
"      <td>97</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>1</th>\n",
"      <td>b</td>\n",
"      <td>98</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>2</th>\n",
"      <td>c</td>\n",
"      <td>99</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"  Letter  ASCII value\n",
"0      a           97\n",
"1      b           98\n",
"2      c           99"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"df2 = pd.DataFrame([['x',120],['y',121],['z',122]],columns=['Letter','ASCII value'])"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"    .dataframe thead th {\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"  <thead>\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th>Letter</th>\n",
"      <th>ASCII value</th>\n",
"    </tr>\n",
"  </thead>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th>0</th>\n",
"      <td>x</td>\n",
"      <td>120</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>1</th>\n",
"      <td>y</td>\n",
"      <td>121</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>2</th>\n",
"      <td>z</td>\n",
"      <td>122</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"  Letter  ASCII value\n",
"0      x          120\n",
"1      y          121\n",
"2      z          122"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"Create a concatenation of dataframes "
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"    .dataframe thead th {\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"  <thead>\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th>Letter</th>\n",
"      <th>ASCII value</th>\n",
"    </tr>\n",
"  </thead>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th>0</th>\n",
"      <td>a</td>\n",
"      <td>97</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>1</th>\n",
"      <td>b</td>\n",
"      <td>98</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>2</th>\n",
"      <td>c</td>\n",
"      <td>99</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>0</th>\n",
"      <td>x</td>\n",
"      <td>120</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>1</th>\n",
"      <td>y</td>\n",
"      <td>121</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>2</th>\n",
"      <td>z</td>\n",
"      <td>122</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"  Letter  ASCII value\n",
"0      a           97\n",
"1      b           98\n",
"2      c           99\n",
"0      x          120\n",
"1      y          121\n",
"2      z          122"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([df1,df2])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Apply ignore_index so that the inex appears as continuous representation from 0 to (len-1)."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"    .dataframe thead th {\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"  <thead>\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th>Letter</th>\n",
"      <th>ASCII value</th>\n",
"    </tr>\n",
"  </thead>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th>0</th>\n",
"      <td>a</td>\n",
"      <td>97</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>1</th>\n",
"      <td>b</td>\n",
"      <td>98</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>2</th>\n",
"      <td>c</td>\n",
"      <td>99</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>3</th>\n",
"      <td>x</td>\n",
"      <td>120</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>4</th>\n",
"      <td>y</td>\n",
"      <td>121</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>5</th>\n",
"      <td>z</td>\n",
"      <td>122</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"  Letter  ASCII value\n",
"0      a           97\n",
"1      b           98\n",
"2      c           99\n",
"3      x          120\n",
"4      y          121\n",
"5      z          122"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([df1,df2],ignore_index=True)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"    .dataframe thead th {\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"  <thead>\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th></th>\n",
"      <th>Letter</th>\n",
"      <th>ASCII value</th>\n",
"    </tr>\n",
"  </thead>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th rowspan=\"3\" valign=\"top\">Starting 3 letters</th>\n",
"      <th>0</th>\n",
"      <td>a</td>\n",
"      <td>97</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>1</th>\n",
"      <td>b</td>\n",
"      <td>98</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>2</th>\n",
"      <td>c</td>\n",
"      <td>99</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th rowspan=\"3\" valign=\"top\">Ending 3 letters</th>\n",
"      <th>0</th>\n",
"      <td>x</td>\n",
"      <td>120</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>1</th>\n",
"      <td>y</td>\n",
"      <td>121</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>2</th>\n",
"      <td>z</td>\n",
"      <td>122</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"                     Letter  ASCII value\n",
"Starting 3 letters 0      a           97\n",
"                   1      b           98\n",
"                   2      c           99\n",
"Ending 3 letters   0      x          120\n",
"                   1      y          121\n",
"                   2      z          122"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([df1,df2],keys=['Starting 3 letters','Ending 3 letters'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If one particular dataframe doesn't have the columns from another dataframe, such data is simply represented as Nan."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"df3 = pd.DataFrame([['a',97,61],['b',98,62],['c',99,63]],columns=['Letter','ASCII value','HEX value'])"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"    .dataframe thead th {\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"  <thead>\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th>Letter</th>\n",
"      <th>ASCII value</th>\n",
"      <th>HEX value</th>\n",
"    </tr>\n",
"  </thead>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th>0</th>\n",
"      <td>a</td>\n",
"      <td>97</td>\n",
"      <td>61</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>1</th>\n",
"      <td>b</td>\n",
"      <td>98</td>\n",
"      <td>62</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>2</th>\n",
"      <td>c</td>\n",
"      <td>99</td>\n",
"      <td>63</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"  Letter  ASCII value  HEX value\n",
"0      a           97         61\n",
"1      b           98         62\n",
"2      c           99         63"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"    .dataframe thead th {\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"  <thead>\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th>ASCII value</th>\n",
"      <th>HEX value</th>\n",
"      <th>Letter</th>\n",
"    </tr>\n",
"  </thead>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th>0</th>\n",
"      <td>97</td>\n",
"      <td>NaN</td>\n",
"      <td>a</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>1</th>\n",
"      <td>98</td>\n",
"      <td>NaN</td>\n",
"      <td>b</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>2</th>\n",
"      <td>99</td>\n",
"      <td>NaN</td>\n",
"      <td>c</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>0</th>\n",
"      <td>97</td>\n",
"      <td>61.0</td>\n",
"      <td>a</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>1</th>\n",
"      <td>98</td>\n",
"      <td>62.0</td>\n",
"      <td>b</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>2</th>\n",
"      <td>99</td>\n",
"      <td>63.0</td>\n",
"      <td>c</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"   ASCII value  HEX value Letter\n",
"0           97        NaN      a\n",
"1           98        NaN      b\n",
"2           99        NaN      c\n",
"0           97       61.0      a\n",
"1           98       62.0      b\n",
"2           99       63.0      c"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([df1,df3],sort=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"the parameter join = \"inner\" let's us concat only those data points that are mutually present in all participating dataframes."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"    .dataframe thead th {\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"  <thead>\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th>Letter</th>\n",
"      <th>ASCII value</th>\n",
"    </tr>\n",
"  </thead>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th>0</th>\n",
"      <td>a</td>\n",
"      <td>97</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>1</th>\n",
"      <td>b</td>\n",
"      <td>98</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>2</th>\n",
"      <td>c</td>\n",
"      <td>99</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>0</th>\n",
"      <td>a</td>\n",
"      <td>97</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>1</th>\n",
"      <td>b</td>\n",
"      <td>98</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>2</th>\n",
"      <td>c</td>\n",
"      <td>99</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"  Letter  ASCII value\n",
"0      a           97\n",
"1      b           98\n",
"2      c           99\n",
"0      a           97\n",
"1      b           98\n",
"2      c           99"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([df1,df3],join = \"inner\")"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"df4 = pd.DataFrame([['apple','An apple a day!'],['boy','The boy plays football']],\n",
"                  columns=['Words','Sentence'])"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"    .dataframe thead th {\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"  <thead>\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th>Words</th>\n",
"      <th>Sentence</th>\n",
"    </tr>\n",
"  </thead>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th>0</th>\n",
"      <td>apple</td>\n",
"      <td>An apple a day!</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>1</th>\n",
"      <td>boy</td>\n",
"      <td>The boy plays football</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"   Words                Sentence\n",
"0  apple         An apple a day!\n",
"1    boy  The boy plays football"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df4"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to combine dataframes column-wise, ie the dataframe will be appended one after the other, use aixs = 1 parameter."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"    .dataframe thead th {\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"  <thead>\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th>Letter</th>\n",
"      <th>ASCII value</th>\n",
"      <th>Words</th>\n",
"      <th>Sentence</th>\n",
"    </tr>\n",
"  </thead>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th>0</th>\n",
"      <td>a</td>\n",
"      <td>97</td>\n",
"      <td>apple</td>\n",
"      <td>An apple a day!</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>1</th>\n",
"      <td>b</td>\n",
"      <td>98</td>\n",
"      <td>boy</td>\n",
"      <td>The boy plays football</td>\n",
"    </tr>\n",
"    <tr>\n",
"      <th>2</th>\n",
"      <td>c</td>\n",
"      <td>99</td>\n",
"      <td>NaN</td>\n",
"      <td>NaN</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"  Letter  ASCII value  Words                Sentence\n",
"0      a           97  apple         An apple a day!\n",
"1      b           98    boy  The boy plays football\n",
"2      c           99    NaN                     NaN"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([df1,df4],axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use verify_integrity=True if you want to be sure that whatever dataframe's index is present, there should not be duplicate values."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"    .dataframe thead th {\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"  <thead>\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th>0</th>\n",
"    </tr>\n",
"  </thead>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th>a</th>\n",
"      <td>1</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"   0\n",
"a  1"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df5 = pd.DataFrame([1],index=['a'])\n",
"df5"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
"    .dataframe tbody tr th:only-of-type {\n",
"        vertical-align: middle;\n",
"    }\n",
"\n",
"    .dataframe tbody tr th {\n",
"        vertical-align: top;\n",
"    }\n",
"\n",
"    .dataframe thead th {\n",
"        text-align: right;\n",
"    }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
"  <thead>\n",
"    <tr style=\"text-align: right;\">\n",
"      <th></th>\n",
"      <th>0</th>\n",
"    </tr>\n",
"  </thead>\n",
"  <tbody>\n",
"    <tr>\n",
"      <th>a</th>\n",
"      <td>2</td>\n",
"    </tr>\n",
"  </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"   0\n",
"a  2"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df6 = pd.DataFrame([2],index=['a'])\n",
"df6"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"scrolled": true
},
"outputs": [
{
"ename": "ValueError",
"evalue": "Indexes have overlapping values: Index(['a'], dtype='object')",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mValueError\u001b[0m                                Traceback (most recent call last)",
"\u001b[1;32m<ipython-input-22-effc43a713f2>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mpd\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mconcat\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mdf5\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdf6\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mverify_integrity\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mTrue\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[1;32m~\\Anaconda3\\lib\\site-packages\\pandas\\core\\reshape\\concat.py\u001b[0m in \u001b[0;36mconcat\u001b[1;34m(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)\u001b[0m\n\u001b[0;32m    223\u001b[0m                        \u001b[0mkeys\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mkeys\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mlevels\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mlevels\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnames\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mnames\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    224\u001b[0m                        \u001b[0mverify_integrity\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mverify_integrity\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 225\u001b[1;33m                        copy=copy, sort=sort)\n\u001b[0m\u001b[0;32m    226\u001b[0m     \u001b[1;32mreturn\u001b[0m \u001b[0mop\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget_result\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    227\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m~\\Anaconda3\\lib\\site-packages\\pandas\\core\\reshape\\concat.py\u001b[0m in \u001b[0;36m__init__\u001b[1;34m(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy, sort)\u001b[0m\n\u001b[0;32m    376\u001b[0m         \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mcopy\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mcopy\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    377\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 378\u001b[1;33m         \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mnew_axes\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_get_new_axes\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    379\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    380\u001b[0m     \u001b[1;32mdef\u001b[0m \u001b[0mget_result\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m~\\Anaconda3\\lib\\site-packages\\pandas\\core\\reshape\\concat.py\u001b[0m in \u001b[0;36m_get_new_axes\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m    456\u001b[0m                 \u001b[0mnew_axes\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mi\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0max\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    457\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 458\u001b[1;33m         \u001b[0mnew_axes\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0maxis\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_get_concat_axis\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    459\u001b[0m         \u001b[1;32mreturn\u001b[0m \u001b[0mnew_axes\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    460\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m~\\Anaconda3\\lib\\site-packages\\pandas\\core\\reshape\\concat.py\u001b[0m in \u001b[0;36m_get_concat_axis\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m    514\u001b[0m                                                   self.levels, self.names)\n\u001b[0;32m    515\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 516\u001b[1;33m         \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_maybe_check_integrity\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mconcat_axis\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    517\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    518\u001b[0m         \u001b[1;32mreturn\u001b[0m \u001b[0mconcat_axis\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m~\\Anaconda3\\lib\\site-packages\\pandas\\core\\reshape\\concat.py\u001b[0m in \u001b[0;36m_maybe_check_integrity\u001b[1;34m(self, concat_index)\u001b[0m\n\u001b[0;32m    523\u001b[0m                 \u001b[0moverlap\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mconcat_index\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mconcat_index\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mduplicated\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0munique\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    524\u001b[0m                 raise ValueError('Indexes have overlapping values: '\n\u001b[1;32m--> 525\u001b[1;33m                                  '{overlap!s}'.format(overlap=overlap))\n\u001b[0m\u001b[0;32m    526\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    527\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;31mValueError\u001b[0m: Indexes have overlapping values: Index(['a'], dtype='object')"
]
}
],
"source": [
"pd.concat([df5, df6], verify_integrity=True)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

A long article, wasn’t it? But I hope that it was informative and easy to understand as well. One can read thousands of article and move ahead. However, the real understanding of things comes when we get hands-on practice. I insist that you practice the above-mentioned examples in the gists thoroughly in your system.

As Promised, I have created a Github repository that contains problems with their respective solutions for each topic that is explained above. Clone the repository and practice, practice, practice!

Find the repository here.

Source: towardsdatascience