Learning about Data Structures and important packages like Numpy and Pandas in Python.
This article is the second piece in the Python For Data Science Series. In case you haven’t gone through the introduction of Python(part 1), go ahead and skim through that article here. After knowing about the basics, its time to indulge in more challenging topics of Python. In this article, we shall be looking into Python’s usage for data representation and manipulation methods. The following topics will be covered:
- Data Structures: Tuple, List, Dictionary.
- Packages: Numpy, Pandas.
SPOILER ALERT: You can expect a bonus at the end of the article!🤫
PART 1. Data Structures
Python provides simple yet proficient data structures that help the data people to represent, manipulate and store different form of data as per their need. Let’s get comfortable with these subtle structures one by one.
1.1 Tuple:
A tuple is an immutable and fixed-length sequence of Python objects.
#Declaring a tuple tup = 4,5,6 print(tup) #output : (4, 5, 6) #Advisable to use parenthesis tup = (4,5,6) print(tup) #Output: (4, 5, 6) #nested tuples nested_tup = ((1,2,3),4,5,6) print(nested_tup) #output: ((1, 2, 3), 4, 5, 6) #access the data indiviually by enclosing index in square brackets print(nested_tup[1]) #output : 4 #default index begins from 0 print(tup[0]) #output: 4 print(nested_tup[0]) #output: (1, 2, 3) #convert any sequence using tuple() tup = tuple([4,8,7]) print(tup) #output: (4, 8, 7) #String representation tup = tuple('Ramayana') print(tup) #o/p: ('R', 'a', 'm', 'a', 'y', 'a', 'n', 'a') print(tup[4]) #o/p: y
- You can simply create a tuple with a comma-separated sequence of values. Not always mandatory, but it is recommended to enclose the sequence within parenthesis.
- Like many other languages, the default index begins from 0.
- Elements of the tuples can be accessed by stating the required index number in square brackets ‘[<index_number>]’
- You can also create a tuple of tuples called a nested tuple.
- If you have a string or a sequence that you want to convert into a tuple, simply use tuple() method.
Tuple Manipulation: A tuple is immutable that is once you define a tuple’s contents, you cannot modify it.
#declare a tuple as below tup = tuple(['foo',[1,2,3],True]) print(tup) #output: ('foo', [1, 2, 3], True) #Let's try to change the objects in this tuple tup[2] = False #output: ------------------------- #TypeError Traceback (most recent call last) #<ipython-input-2-b89d0c4ae599> in <module>() #----> 1 tup[2] = False #TypeError: 'tuple' object does not support item assignment #a tuple object which is mutable can be changed tup[1].append(7) print(tup) #output: ('foo', [1, 2, 3, 7], True) #tuple concatenation tup1 = (1,2,3) tup2 = ('a','b','c') tup1 + tup2 #output: (1, 2, 3, 'a', 'b', 'c') ('a','c','d') + (2,5,6,4) #output: ('a', 'c', 'd', 2, 5, 6, 4) #tuple multiplication tup * 2 #output : ('foo', [1, 2, 3, 7], True, 'foo', [1, 2, 3, 7], True) tup * 0 #output : ()
- Though the objects stored in a tuple may be mutable themselves, the tuple in itself remains immutable.
- If an object inside a tuple is mutable, such as a list, you can modify it in-place.
- You can concatenate two tuples by using the ‘+’ operator. This operator simple appends one tuple at the tail of another tuple.
- Multiplying a tuple with an integer simply concatenates that number of copies of the tuple.
Unpacking Tuples: While unpacking. the values are assigned to the relevant rightmost variable.
- Consider the example shown beside, the values get assigned one-by-one into the series of variables.
- The nested tuples can be unpacked similarly. If you don’t represent the variables in proper parenthesis, the tuple values in one parenthesis of the nested tuple will be assigned to a single variable.
- The rest operator is used to emphasize the assignment of a few variables. the rest operator takes care of the rest of the values. Look over the example for more clarity on how it works.
- The last one is the count() method which as the name says clearly is used to count the frequency of repetitive elements.
#Value is assigned to the relevant rightmost variable tup = (4,5,6) a,b,c = tup print(a,b,c) #output: 4 5 6 #nested tuple unpacking tup = (4,5,6,(1,2)) a,b,c,(d,e) = tup print(a,e,d) #output: 4 2 1 #rest opertor temp = (4,5,6,7,1,2,3,7,8,9,0,11,2,3,43,54,66) a,b,c,*rest = temp print(a) # o/p: 4 print(c) #o/p: 6 print(rest) #o/p: [7, 1, 2, 3, 7, 8, 9, 0, 11, 2, 3, 43, 54, 66] #count operator tup = (2,2,3,5,2,4,2,4,1,0) print(tup.count(2)) #o/p: 4 print(tup.count(7)) #o/p : 0 print(tup.count(5)) #o/p : 1
1.2 List:
A list is a variable-length and mutable data structure.
- You can define a list using square brackets [] or using the list type function.
- As in tuple, the index in the list also starts with zero and ends at the length of the list.
#define a list using square brackets list1 = [2,3,4,'clap',None,False] print(list1) #o/p: [2, 3, 4, 'clap', None, False] #accessing list elements print(list1[0]) #o/p: 2 print(list1[5]) #o/p: False #declare a tuple tup = ('Sneeze','Wipe','Breathe') #convert the tuple to list tup_to_list = list(tup) print(tup) #o/p:('Sneeze', 'Wipe', 'Breathe') print(type(tup)) #o/p: <class 'tuple'> print(tup_to_list) #o/p: ['Sneeze', 'Wipe', 'Breathe'] print(type(tup_to_list)) #o/p: <class 'list'>
Adding and removing elements:
#modifying elements in-place a_list = ['peek','a','boo'] a_list[2] = 'bawww' print(a_list) #o/p: ['peek', 'a', 'bawww'] #append method a_list.append('please') print(a_list) #o.p: ['peek', 'a', 'bawww', 'please'] #insert method a_list.insert(0,'hey') print(a_list) #o/p: ['hey', 'peek', 'a', 'bawww', 'please'] #removing element from specific place a_list.pop(3) print(a_list) #o/p: ['hey', 'peek', 'a', 'please'] #using the remove method long_list = ['hey','hello','namaste','pranam','hi','namaskar','hola','bonjour','salaam','Marhaba','Hallo','Hoi' ] print(long_list) #o/p: ['hey', 'hello', 'namaste', 'pranam', 'hi', 'namaskar', 'hola', 'bonjour', 'salaam', 'Marhaba', 'Hallo', 'Hoi'] long_list.remove('hola') print(long_list) #o/p: ['hey', 'hello', 'namaste', 'pranam', 'hi', 'namaskar', 'bonjour', 'salaam', 'Marhaba', 'Hallo', 'Hoi']
- Elements can be appended to the end of the list using the append function.
- Using the insert function you can insert the element at a specified location within the list. The insert function tends to be more expensive than the append function as the interpreter has to internally shift the previous elements to make room for the current element.
- Or if you simply want to modify an element in the list, you can change it in-place as depicted in the code.
- Regarding removal of elements, the inverse of insert function is the pop method.
- You simply need to specify the index of the element you want to pop and it will be removed from the list, shifting the elements to the right of the deleted element to the left.
- Suppose the list is a long one, and you don’t know the index of the value you want to remove, you can use the remove keyword which locates the first such value and removes it from the list.
Combining and Concatenating a list:
#concatenating using + operator [1,2,3] + ['a','b','c'] #o/p: [1, 2, 3, 'a', 'b', 'c'] #extend method a = ['what','how','why'] a.extend(['where','when','who']) print(a) #o/p: ['what', 'how', 'why', 'where', 'when', 'who']
Similar to tuples, adding two lists together with a + concatenates them. Or else, if you already have a list defined, you can append multiple elements using the extend method. While using extend operator be sure to put the values inside square brackets.
Slicing: Slicing is selecting sections of your defined list as per your choice.

- You can extract chunks of data from your list by specifying the slice notation, which in its basic form is start_index:stop_index passed to indexing operator[].
#slicing a list using [] operator seq = ['E','N','J','O','Y','L','I','F','E','<3'] print(seq) #o/p: ['E', 'N', 'J', 'O', 'Y', 'L', 'I', 'F', 'E', '<3'] #slice the list using the inclusive start_index and exclusive stop_index #syntax: list_name[start_index:stop_index] seq[2:5] #o/p: ['J', 'O', 'Y'] #an empty start or end index implies beginning or end resp. seq[:5] #o/p: ['E', 'N', 'J', 'O', 'Y'] seq[5:] #o/p: ['L', 'I', 'F', 'E', '<3'] #use negative indices that start at -1 from end of list seq[-1:] #start slicing from '-1th' element until farthest end #o/p: ['<3'] seq[:-3] #start slicing from beginning and stop at element before '-3rd' index #o/p: ['E', 'N', 'J', 'O', 'Y', 'L', 'I'] seq[-8:-2] #start slicing from '-6th' element and go upto '-2th' element #o/p: ['J', 'O', 'Y', 'L', 'I', 'F'] #use a doube colon to imply "everyother" seq[::2] #o/p: ['E', 'J', 'Y', 'I', 'E'] seq[::-3] #o/p: ['<3', 'I', 'O', 'E'] #reverse the list in a stud way seq[::-1] #o/p: ['<3', 'E', 'F', 'I', 'L', 'Y', 'O', 'J', 'N', 'E']
- The start index is inclusive in the output while the stop index is not included in the output so that the number of elements in the result is stop-start.
- Either the start or stop index can be omitted, in which case they default to the start of the sequence and the end of the sequence respectively.
- Negative indices slice the sequence relative to the end.
- Use a double colon to print ‘every other’ say my_list[::2] print ‘every other second element’ of my_list.
- Use a double colon to print the reverse of a list like a python stud(see code)! 😎
Note: Practice slicing really hard. the more you practice the better you understand the indices. While negative indexing always remember to start the last element with ‘-1’ and continue up to the first element.
1.3: Dict:
A dict or Dictionary in Python is an unordered, mutable and indexed data structure. It is considered the most important and detailed data structure in Python.
#create a dict using curly braces my_dict = {} print(type(my_dict)) print(my_dict) #o/p: <class 'dict'> #o/p: {} #defining key:value pairs my_dict = { 'Movie name' : 'The Shawshank Redemption', 'Genre' : 'Drama/Mystery', 'Release Year': 1994, 'Cast' : ['Tim Robbins','Morgan Freeman','Bob Gunton'], 'Director' : 'Frank Darabont' } print(my_dict) #o/p: {'Movie name': 'The Shawshank Redemption', 'Genre': 'Drama/Mystery', #'Release Year': 1994, #'Cast': ['Tim Robbins', 'Morgan Freeman', 'Bob Gunton'], #'Director': 'Frank Darabont'} #accessing individual elements print(my_dict['Movie name']) #o/p: The Shawshank Redemption #accessing a list of values print(my_dict['Cast']) #o/p: ['Tim Robbins', 'Morgan Freeman', 'Bob Gunton'] print(my_dict['Cast'][0]) #o/p: Tim Robbins print(my_dict['Cast'][2]) #o/p: Bob Gunton
- A dict or Dictionary in Python is an unordered, mutable and indexed data structure.
- The data is represented as key: value format in a dictionary. Here both key and value are python objects.
- Each key has a value or a list of values assigned to it.
- Each key, value pair should be separated using a comma.
- You can create an empty dict using curly braces ‘{}’.
- If you want to access a particular value simply pass the key data in the square brackets of your dict.
- If you have a list of values like the ‘Cast’ key in my_dict, and if you want to access a particular element of the list, simply use the index value in square brackets that follow a key declaration in square brackets, E.g: my_dict[‘cast’][0]
#define a dict dict1 = { 'Country':'India', 'State':'Maharashtra', 'District':'Mumbai', 'City' : 'Mulund', 'Pincode' : '400080' } #add a new key to the dict dict1['Street']='M.G. Road' print(dict1) #o/p: {'Country': 'India', 'State': 'Maharashtra', 'District': 'Mumbai', 'City': 'Mulund', 'Pincode': '400080', 'Street': 'M.G. Road'} #check if an item is present in dict or no 'City' in dict1 #o/p: True #delete using del keyword del dict1['Street'] #pop method ret = dict1.pop('Pincode') print(ret) #o/p: 400080 print(dict1) #o/p: {'Country': 'India', 'State': 'Maharashtra', 'District': 'Mumbai', 'City': 'Mulund'}
- To add a new key: value pair in a dictionary, simply add a new key using the square brackets and assign the given value to it.
- To check if an element is present in the dictionary or not, using the in keyword. <Type your key here> in <dictionary name>.
- If the given key is present in the dictionary, the output is True otherwise it is False.
- To delete a pair from the dict, you can use del keyword. This keyword deletes the entire key record. If you don’t specify a key, the entire dict willl be deleted.
- The second option is to use the pop method, which removes the key from the dictionary and also returns the value that was assigned to that key.
- Below are some general methods that could be handy while dealing with a dictionary:
#declare a dictionary dict1 = { 'Country':'India', 'State':'Maharashtra', 'District':'Mumbai', 'City' : 'Mulund', 'Pincode' : '400080', 'Street' : 'M.G. Road', 'Locality': 'Vrindavan Society', 'Landmark' : 'Royal Treat Cafe', 'Apartment Number' : '101-A' } #access the keys dict1.keys() #o/p: dict_keys(['Country', 'State', 'District', 'City', 'Pincode', 'Street', #'Locality', 'Landmark', 'Apartment Number']) #access the values dict1.values() #o/p: dict_values(['India', 'Maharashtra', 'Mumbai', 'Mulund', '400080', 'M.G. Road', #'Vrindavan Society', 'Royal Treat Cafe', '101-A']) #display all the items dict1.items() #o/p: dict_items([('Country', 'India'), ('State', 'Maharashtra'), ('District', 'Mumbai'), #('City', 'Mulund'), ('Pincode', '400080'), ('Street', 'M.G. Road'), #('Locality', 'Vrindavan Society'), ('Landmark', 'Royal Treat Cafe'), #('Apartment Number', '101-A')]) #copy current dict into new dict x = dict1.copy() print(x) #o/[p: {'Country': 'India', 'State': 'Maharashtra', 'District': 'Mumbai', 'City': #'Mulund', 'Pincode': '400080', 'Street': 'M.G. Road', 'Locality': 'Vrindavan Society', #'Landmark': 'Royal Treat Cafe', 'Apartment Number': '101-A'} #get the value of a key dict1.get('City') #o/p: Mulund #clear the dictionary dict1.clear() print(dict1) #o/p: {}
Here I announce that we have studied the basic data structures that we use in Python for analysis that don’t require a package to be imported from. Next we focus on some important packages that are known to be quite helpful in analysis.
PART 2: Packages in Python
Python is an open-source project has a contribution from thousands of developers who have uploaded numerous packages that help us perform activities by simply importing that package and using it. The following is a study of a few of those packages.
2.1. NumPy:
Numpy, short for Numerical Python, is one of the most important packages for numerical computation in Python. Most of the packages that provide scientific functionality use NumPy’s array objects as the lingua franca for data exchange.
#standard convention of importing numpy import numpy as np data = np.random.randn(3,3) #generate a random array of size 3 x 3 #o/p: [[ 0.20158825 -0.11349001 -0.61041965] #[ 0.16453124 1.16433185 -2.97683373] #[ 2.2510491 0.47131988 -0.9696766 ]] #some opertaions on our array print(data * 100) print(data + data) #Creating arrays from lists my_list = [3,6,2,7,6,3,0] my_array = np.array(my_list) print(my_list,type(my_list)) #o/p: [3, 6, 2, 7, 6, 3, 0] <class 'list'> print(my_array,type(my_array)) #o/p: [3 6 2 7 6 3 0] <class 'numpy.ndarray'> #nested lists array1 = np.array(list1) print(array1) #o/p: [[1 2 3 4] # [5 4 3 2]] #create an array of 5 zeros of type int np.zeros(5,dtype = int) #create an ndarray of size 3 x 2 with all ones np.ones((3,2),dtype = int) #o/p: array([[1, 1], # [1, 1], # [1, 1]]) #create a 2 x 2 array filled with 90 np.full((2,2),90) #o/p: array([[90, 90], # [90, 90]]) #create an array of sequence from 0 to 30 with interval of 3 np.arange(0,30,3) #not including 30 though #o/p: array([ 0, 3, 6, 9, 12, 15, 18, 21, 24, 27]) #create an array of evenly spaced data between 0 and 1 np.linspace(0,1,5) #o/p: array([0. , 0.25, 0.5 , 0.75, 1. ]) #create an identity matrix array np.eye(3) #o/p: array([[1., 0., 0.], # [0., 1., 0.], # [0., 0., 1.]])
In some ways, NumPy’s array objects are much like Python’s built-in list type, but NumPy is a winner when we deal with colossal data. Knowing about NumPy in depth will be eventually valuable no matter what field of data science you’re interested in!
- One of the key features of Numpy is the ndarray or N-dimensional array object which is a fast, flexible container for large datasets in Python.
- Arrays enable the programmer to perform mathematical operations on whole chunks of data having similar syntax.
- Let’s first import Numpy package and generate a small array of random numbers.
- A ndarray is a generic multi-dimensional container for homogeneous data that is all the data points should be of the same type.
- The easiest way to create an array is to use the array function. This function accepts any sequence-like object and produces a Numpy array containing the passed data, for e.g. a list.
- Nested sequences, like a list of equal-length lists, will be converted into a multidimensional array.
- Since the list in code example was a list of lists, the NumPy array has two dimensions with shape inferred from the data correctly.
- Apart from np.array function, there are a number of other functions for creating new arrays. Please go through given in Numpy_intro.py to get an idea of various ways to declare an array.
Numpy Array Attributes:
import numpy as np #seed for definite reproducibility np.random.seed(0) #create a 1-D array using random number #give me 7 random numbers between 5 to 10. x1 = np.random.randint(5,10,size=7) print(x1) #create a 2-D array #give me a 3 rows x 3 columns array #that has random integers between 5 to 10. x2 = np.random.randint(5,10,size=(3,3)) print(x2) #create a 3-D array #give me a 3 x 4 x 4 array x3 = np.random.randint(5,10,size=(3,4,4)) print(x3) #using ndim fucntion print(x1.ndim) #o/p: 1 print(x2.ndim) #o/p: 2 print(x3.ndim) #o/p: 3 #print size of each dimension print(x1.shape) #o/p: (7,) print(x2.shape) #o/p: (3, 3) print(x3.shape) #o/p: (3, 4, 4) #print total size print(x1.size) #o/p: 7 print(x2.size) #o/p: 9 print(x3.size) #o/p: 48
Let’s start with some basic array attributes that are used to obtain information about the array. First, we will create three arrays, one-dimensional, two-dimensional and three-dimensional random array.
- We set seed so that the same random arrays are generated each time the code is run.
- Use randint to generate random integers between low(inclusive) to high(exclusive). The format is numpy.random.randint(low,high,size,dtype).
- Now its time to dig into some attributes of an array, the ndimattribute, used as <array_name>.ndim, produces the number of dimensions.
- Then using the shape attribute we can find the size of each dimension and the size attribute determines the size of the entire array.
Accessing Array elements: If you are familiar with Python’s list indexing, indexing in Numpy will feel quite easy to understand. In a 1-D array, you can access the i-th value, starting from 0, by specifying i-th index into the indexing operator i.e []. For a multi-dimensional array, you are expected to specify the detailed location of an element using a comma-separated tuple of indices. Indexing and slicing are best understood when performed. I urge you to practice and think over every example given in the below gist.
import numpy as np arr = np.array([4,5,6,7,1,2,3,0]) #accessing single element using indices print(arr[0],arr[1],arr[6]) #o/p: 4 5 3 #accessing single element using negative indices print(arr[-1],arr[-5]) #o/p: 0 7 #create a 3 x 3 array of random integers arr1 = np.random.randint(0,5,(3,3)) #print the zeroth row and second column value print(arr1[0,2]) #print the second last element from last row print(arr1[-1,-2]) ############################################### #accessing subarrays ##############1-D subarrays#################### arr = [4,5,6,7,3,2,1,0,5,6,7,7,8,4,3,3] arr[:5] #o/p: [4, 5, 6, 7, 3] arr[5:] #o/p: [2, 1, 0, 5, 6, 7, 7, 8, 4, 3, 3] arr[4:7] #o/p: [3, 2, 1] #every other element arr[::2] #o/p: [4, 6, 3, 1, 5, 7, 8, 3] #every other element starting at index 1 arr[1::2] #o/p: [5, 7, 2, 0, 6, 7, 4, 3] #reversed arr[::-1] #o/p: [3, 3, 4, 8, 7, 7, 6, 5, 0, 1, 2, 3, 7, 6, 5, 4] #reversed everyother from index 3 arr[3::-2] #o/p: [7, 5] ############ multi-dimensional array############## #2-D array arr1 = np.random.randint(0,10,(3,3)) print(arr1) #o/p: [[4 8 1] # [7 9 4] # [1 3 3]] #print upto 1st row and all columns arr1[:2,:3] #o/p: array([[4, 8, 1], # [7, 9, 4]]) #print everything from 1st row and 1st column arr1[1:,1:] #o/p: array([[9, 4], # [3, 3]]) #print all rows and every other column arr1[:3,::2] #o/p: array([[4, 1], # [7, 4], # [1, 3]]) #print every other row with columns reversed arr1[::2,::-1] #o/p:array([[1, 8, 4], # [3, 3, 1]]) #3-D array #defining a 3 dimensional array of random integers between 0 to 10 arr2 = np.random.randint(0,10,(4,3,3)) #print first two dimensions arr2[:2] #now print all rows from these 2 dimensions arr2[:2,:3] #slice up the above output into just 2 columns arr2[:2,:3,:2] #print every other dimension and row with reversed columns arr2[::2,::2,::-1]
2.2 Pandas:
Pandas is an open source library for Python that was particularly created for data manipulation and analysis of huge chunks of data. Pandas offers robust data structures and functions for manipulating data easily.
But wait, that’s what lists, dict and Numpy’s ndarrays could do too, So why Pandas? Well, there are many justified reasons. One of them is in most cases the data person will get data in csv file formats. Pandas is quite famous with easily recognizing columns headers automatically from these csv files. In short, Pandas is awesome! 🐼🤗 The two primary data structures in Pandas are Series(one Dimensional) and Dataframe(two dimensional).
First, we can use the Pandas library by calling the import method as
import pandas as pd
2.2.1 Series:
A Series is a one-dimensional array of indexed data that can hold data of any type. The axis labels are collectively referred to as index. The basic method to generate a series is :
my_series = pd.Series(data, index=index)
Where data can be a Python dict, ndarray or even a scalar value(say 7), indexcan be a list of axis labels. Thus, we will study a few cases depending on what data is-
import pandas as pd import numpy as np ser = pd.Series(np.random.randint(0,10,5)) ser #o/p: #0 1 #1 8 #2 9 #3 2 #4 6 #dtype: int32 index=['Tomatoes','Potato','Eggs','Onions','Garlic'] ser = pd.Series(np.random.randint(0,10,5),index) ser #o/p: #Tomatoes 9 #Potato 7 #Eggs 3 #Onions 8 #Garlic 9 #dtype: int32 ##converting dict to series my_dict = {'d': 4,'n':6,'a':7} dict_to_series = pd.Series(my_dict) dict_to_series #o/p: #d 4 #n 6 #a 7 #dtype: int64 my_dict1 = {'a':7,'b':8,'c':10} dict_to_ser = pd.Series(my_dict1,index = ['a','e','b','c','d']) dict_to_ser #o/p: #a 7.0 #e NaN #b 8.0 #c 10.0 #d NaN #dtype: float64 ##From Scalar value to Series ser1 = pd.Series(5,index=['This','is','scalar','format']) ser1 #o/p: #This 5 #is 5 #scalar 5 #format 5 #dtype: int64
a. From ndarray:
If you are using a ndarray as data for series, the length of the index should be of the same length. If no index is defined, a default index will be created having values [0,…, len(data)-1].
b. From a dict:
As we have learned earlier, a dict is key: value representation of data. We can use a dict to define a Series object. Here if we don’t pass index values, the ‘key’ mentioned in the definition of the dict is used as the index. The key attribute appears in the index in the same order as defined in the dict. On the other hand, if the index is defined in Series definition, only the data that corresponds to the keys used in the dict are pulled out. The remaining index labels will be recorded as Nan, not a number, which is the standard data marker for missing data.
c. From a scalar value:
If data is a scalar value or a constant, then we definitely need to specify an index. The same constant value will be replicated up to the length of the index corresponding each index point.
The following notebook showcases a list of certain functions that dig information about the series object.
{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "#import the necessary libraries\n", "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "0 4\n", "1 7\n", "2 3\n", "3 0\n", "4 9\n", "5 1\n", "dtype: int64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#define a series wih no index\n", "ser = pd.Series([4,7,3,0,9,1])\n", "ser" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The value attribute is used to extract the actual data stored in the Series. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "array([4, 7, 3, 0, 9, 1], dtype=int64)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The index attribute is used to represent the index objects." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=6, step=1)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser.index" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#You can play with the indices using the arithmetic operators\n", "ser[1*2]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser[2+3]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "A 4\n", "B 2\n", "C 6\n", "dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Define a Series with specific index\n", "ser1 = pd.Series([4,2,6],index = ['A','B','C'])\n", "ser1" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "Index(['A', 'B', 'C'], dtype='object')" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser1.index" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([4, 2, 6], dtype=int64)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser1.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can manipulate the values directly by specifying the index and the new value." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A 10\n", "B 2\n", "C 6\n", "dtype: int64\n" ] } ], "source": [ "ser1['A'] = 10\n", "print(ser1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can select multiple indices at once using a list of indices. The output will in the order of indices you mention in the list." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "C 6\n", "A 10\n", "dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser1[['C','A']]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A 10\n", "C 6\n", "dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#select values that are greater than 5 from Series \n", "ser1[ser1>5]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A 100\n", "B 4\n", "C 36\n", "dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#multiply the contents of the two Series\n", "ser1 * ser1" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A 20\n", "B 12\n", "C 16\n", "dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#alter the content of Series by performing addition and subtraction \n", "ser1 + 10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "you can check if a particular index value is present, you can use the in operator. Remember Python is case sensitive and you need to specify your index as is." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'a' in ser1" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'A' in ser1" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Mumbai 4.0\n", "Delhi 6.0\n", "Srinagar NaN\n", "Kolkata 3.0\n", "Bangalore 8.0\n", "dtype: float64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "index = ['Mumbai','Delhi','Srinagar','Kolkata','Bangalore']\n", "ser2 = pd.Series([4,6,np.NaN,3,8],index)\n", "ser2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas uses 'NaN' as its standard representation for missing or unknown values. Missing values form an important aspect of data analysis. Often the data person needs to deal with the missing values to get a complete set of data points to perform further analysis. The following functions provide a way to detect all the null values in our Series object." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "Mumbai False\n", "Delhi False\n", "Srinagar True\n", "Kolkata False\n", "Bangalore False\n", "dtype: bool" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.isnull(ser2)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "Mumbai True\n", "Delhi True\n", "Srinagar False\n", "Kolkata True\n", "Bangalore True\n", "dtype: bool" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.notnull(ser2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we move ahead to manipulate two distinct series. When we use an arithmetic operator to perform some operation between 2 series, the common indices amongst the two series are considered for the operation. The rest are mentioned as NaN. " ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "index = ['Delhi','Mumbai','Goa','Bangalore']\n", "ser3 = pd.Series([5,7,3,1],index)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "Bangalore 9.0\n", "Delhi 11.0\n", "Goa NaN\n", "Kolkata NaN\n", "Mumbai 11.0\n", "Srinagar NaN\n", "dtype: float64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser2 + ser3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The name attribute allows us to call a particular Series with a name. This attribute can also be used for the indices of a Series. " ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "ser2.name = 'Temperature'" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "ser2.index.name = 'City'" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "City\n", "Mumbai 4.0\n", "Delhi 6.0\n", "Srinagar NaN\n", "Kolkata 3.0\n", "Bangalore 8.0\n", "Name: Temperature, dtype: float64" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser2" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }
2.2.2. Dataframe:
A dataframe is a two-dimensional indexed data structure with columns of different data types. It can be thought of as a Dict of Series all sharing the same index. A dataframe is the handiest and hence widely used data structure. You can think of a dataframe as a spreadsheet or a SQL table. Like Series, we can construct a dataframe in multiple ways as follows:
data = pd.Dataframe(data,index,columns)

Where index and columns of your choice can be specified optionally. Otherwise, the default index is given as [0….(len-1)] as a mutual index to all columns. In the diagram given, we assigned index as 0 to 4, while columns are Book Title, Author, Finished Reading?. And everything filled within the table is data.
You can create a dataframe from possibly every other data structure(proceed here to study dataframe construction). To keep our study relevant to data science, I shall assure you that you will use dataframes as a container for large .csv files, .data files or any other file format that stores data. Therefore, we start with the most basic file format that you might work on, i.e. comma separated values and how to load such files.
data = pd.read_csv("path_to_file\file_name.csv")

I downloaded a dataset from kaggle and read it using the above syntax. After reading the dataset looks like the figure beside. Let’s start with some basic handy commands that help analyze a dataframe.
import pandas as pd import numpy as np #define a series of dict to obtain a Dataframe index=['Monday','Tuesday','Wednesday','Thursday','Friday'] dict_data = {'Banana': pd.Series([1,0,2,1,0],index), 'Apple' : pd.Series([2,1,0,2,1],index), 'Eggs':pd.Series([2,1,1,2,1],index)} #converting data dictionary into dataframe df = pd.DataFrame(dict_data) #tell the number of rows and columns in dataframe df.shape #printing all column names df.columns #o/p: Index(['Banana', 'Apple', 'Eggs'], dtype='object') #printing all indices df.index #o/p: Index(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'], dtype='object') #print first 3 values #change this 3 to any n value to list n first values df.head(3) #print last 3 values #same as head() df.tail(3) #print the count,mean,std and the five number summary of dataset df.describe() #how to access specific columns df.Eggs df['Eggs'] #access the first row of this subset df['Eggs'][0] #access first till third element df['Eggs'][0:3]
Indexing and Selecting Data:
The primary objective in this section is how to slice, dice and generally get and set subsets of pandas objects. Just like Numpy arrays, we can select and index data, the only difference is that in Series and Dataframe we can use a list or even a slice labels.
dataframe[0] #select the zeroth index data dataframe[0:10] #select data from zeroth to ninth index dataframe[[0,5,6,4,7]] #select specific index in a list dataframe[dataframe.SOP > 3.0] #select all rows where SOP > 3.0


#import libraries import pandas as pd import numpy as np #call the data into dataset #insert your file's path here data = pd.read_csv("F:\Projects\datasets\graduate-admissions\Admission_Predict.csv") #using iloc to access the data subsets row-wise data.iloc[5,] #print the fifth row element of the dataframe data.iloc[1:6] #print the row element from first index upto fifth element data.iloc[[5,10,15,20]] #print alternate fifth element data.iloc[[True,False,True]] #using boolean values print zeroth and second row #using iloc to access the data subsets column-wise data.iloc[:,5] #print all rows from fifth column data.iloc[:,2:5] #print all rows from second to fourth column data.iloc[:,[2,4,5]] #print second,fourth and fifth column data.iloc[:,[True,False,False,True]] #print first then directly fourth column data.iloc[3:9,1:5] #print 3rd to 8th rows then 1st to 4th columns data.iloc[[0,2,4,6,8,10],[1,3,5,7,]] #print even rows and odd columns
We use .iloc to focus on the integer based positions or locations of data, starting from 0 up to length-1 of either axis. But it may also be used with a boolean array. iloc will raise an IndexError if it finds the mentioned index or slice of the index is out of bounds. It is important to use the colon operator in row selection if we are only planning to use column selection.
#use the same datasource as in iloc example
data.loc[0] #print first row data[0:6] #print first to fifth rows data.loc[[2,5,6,4]] #print 2nd,5th,6th and 4th rows data.loc[[True,False,True]] #print first and third rows data.loc[:,'SOP'] #print all rows from SOP column data.loc[:,'Serial No.':'Research'] #print all rows from 'Serial No' to 'Research' data.loc[:,['Serial No.','SOP','CGPA']] #print Serial No., SOP and CGPA only data.loc[:,[True,False,True]] #print first and third columns #print the given element from SOP column data.loc[7,'SOP'] #o/p: 3.0 #print all rows from third upto seventh row #print the above rows from Serial No. to SOP data.loc[3:8,'Serial No.':'SOP'] #print specific rows and column list data.loc[[1,3,5,7],['Serial No.','SOP','CGPA']] #check all rows where chance of admit is greater than 75% data.loc[:,'Chance of Admit '] > 0.75
On the other hand, loc is used for label based extraction of data. We have to specify the actual labels or index names and column names as parameters.
We can select a slice of such columns or even a boolean array. loc will raise a KeyError when the items are not found. We can specify the row index as 0 to length -1 if its an integer. Please note that here the integer is not a position but instead it a “label”. For columns, we are expected to use the respective column names, say ‘Column 1’,’Column 2’, etc.
Dropping Data from Axes:
#dropping data from Series data = pd.Series([4,5,3,6,7],index=['dog','cat','rat','sparrow','pigeon']) #drop the rat and cat data new_data = data.drop(['rat','cat']) data '''O/p: dog 4 cat 5 rat 3 sparrow 6 pigeon 7 dtype: int64 ''' new_data '''O/p: dog 4 sparrow 6 pigeon 7 dtype: int64 ''' #Dropping data from axes in dataframe data = pd.DataFrame(np.random.randint(0,10,(4,3)), index=['Aquaman', 'Batman', 'Superman','Wonder Woman'], columns=['Power', 'Reflex', 'Skills']) #delete the batman row data.drop('Batman') #delete the power column data.drop('Power',axis=1) data.drop('Skills',axis='columns') #delete data from current dataframe and reflect changes in it data.drop('Aquaman',inplace=True) data.drop('Power',axis=1,inplace=True)
W can drop a particular data entry from the dataframe using the drop method. This drop method deletes the specified data point from the object and without mending the original data object, it just returns a new object which has the deletions reflected.
While using drop method in a dataframe, we can drop values from the row by simply mentioning them in quotes. Otherwise, if we want to wipe out a particular column, we have to specify an additional parameter, axis =1 or axis =’ columns’. (axis=0 implies rows)
Whenever we use drop, data is reflected in a new object while protecting the original data. If you want to mend the original data, use the parameter inplace=True. Be careful while using inplace as it will drop data from the original object permanently.
GroupBy method:
GroupBy function in pandas resembles the group by in SQL. It groups the entire dataframe keeping one particular column in consideration. After we have grouped the dataframe in this manner, we can perform various actions like minimum, maximum or finding mean on this collective data. Any groupby operation involves one of the following operations on the original data object:
- Splitting the data based on some criteria.
- Applying a function to each group independently.
- Combining the results into a data structure.
{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a dataframe with dummy data that has repeated categorical data to ensure grouping." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame({'Random food' : ['Chicken', 'Eggs', 'Sandwich','Potato', 'Eggs','Cheese cubes','Pizza','Oats',\n", " 'Potato', 'Tomato', 'Eggs', 'Chicken','Cheese cubes','Sandwich',],\n", " 'Day of Week' : ['Monday', 'Tuesday', 'Monday', 'Wednesday','Sunday','Saturday','Sunday',\n", " 'Friday', 'Wednesday', 'Monday', 'Tuesday', 'Monday','Thursday', 'Thursday'],\n", " 'How many times/day?': np.random.randint(low=1,high=4,size=14),\n", " 'Quantity' : np.random.randint(low=1,high=5,size=14),\n", " 'Price in rupees' : np.random.randint(low=15,high=200,size=14),\n", " 'Healthy?': ['Yes','No','Yes','Yes','Yes','No','No','Yes','Yes','No','No','Yes','No','Yes']})" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Random food</th>\n", " <th>Day of Week</th>\n", " <th>How many times/day?</th>\n", " <th>Quantity</th>\n", " <th>Price in rupees</th>\n", " <th>Healthy?</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>Chicken</td>\n", " <td>Monday</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>124</td>\n", " <td>Yes</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>Eggs</td>\n", " <td>Tuesday</td>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>41</td>\n", " <td>No</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>Sandwich</td>\n", " <td>Monday</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>192</td>\n", " <td>Yes</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>Potato</td>\n", " <td>Wednesday</td>\n", " <td>1</td>\n", " <td>3</td>\n", " <td>107</td>\n", " <td>Yes</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>Eggs</td>\n", " <td>Sunday</td>\n", " <td>2</td>\n", " <td>3</td>\n", " <td>113</td>\n", " <td>Yes</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>Cheese cubes</td>\n", " <td>Saturday</td>\n", " <td>1</td>\n", " <td>3</td>\n", " <td>169</td>\n", " <td>No</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>Pizza</td>\n", " <td>Sunday</td>\n", " <td>2</td>\n", " <td>4</td>\n", " <td>53</td>\n", " <td>No</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>Oats</td>\n", " <td>Friday</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>86</td>\n", " <td>Yes</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>Potato</td>\n", " <td>Wednesday</td>\n", " <td>3</td>\n", " <td>1</td>\n", " <td>37</td>\n", " <td>Yes</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>Tomato</td>\n", " <td>Monday</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>43</td>\n", " <td>No</td>\n", " </tr>\n", " <tr>\n", " <th>10</th>\n", " <td>Eggs</td>\n", " <td>Tuesday</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>49</td>\n", " <td>No</td>\n", " </tr>\n", " <tr>\n", " <th>11</th>\n", " <td>Chicken</td>\n", " <td>Monday</td>\n", " <td>2</td>\n", " <td>2</td>\n", " <td>51</td>\n", " <td>Yes</td>\n", " </tr>\n", " <tr>\n", " <th>12</th>\n", " <td>Cheese cubes</td>\n", " <td>Thursday</td>\n", " <td>2</td>\n", " <td>3</td>\n", " <td>179</td>\n", " <td>No</td>\n", " </tr>\n", " <tr>\n", " <th>13</th>\n", " <td>Sandwich</td>\n", " <td>Thursday</td>\n", " <td>1</td>\n", " <td>4</td>\n", " <td>15</td>\n", " <td>Yes</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Random food Day of Week How many times/day? Quantity Price in rupees \\\n", "0 Chicken Monday 1 2 124 \n", "1 Eggs Tuesday 2 1 41 \n", "2 Sandwich Monday 1 1 192 \n", "3 Potato Wednesday 1 3 107 \n", "4 Eggs Sunday 2 3 113 \n", "5 Cheese cubes Saturday 1 3 169 \n", "6 Pizza Sunday 2 4 53 \n", "7 Oats Friday 1 2 86 \n", "8 Potato Wednesday 3 1 37 \n", "9 Tomato Monday 1 1 43 \n", "10 Eggs Tuesday 1 1 49 \n", "11 Chicken Monday 2 2 51 \n", "12 Cheese cubes Thursday 2 3 179 \n", "13 Sandwich Thursday 1 4 15 \n", "\n", " Healthy? \n", "0 Yes \n", "1 No \n", "2 Yes \n", "3 Yes \n", "4 Yes \n", "5 No \n", "6 No \n", "7 Yes \n", "8 Yes \n", "9 No \n", "10 No \n", "11 Yes \n", "12 No \n", "13 Yes " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas objects can be split on either of the axes. The abstract definition of grouping id to provide a mapping of labels to group names. To create a GroupBy object, you may do the following:\n", "grouped = data.groupby(key)\n", "grouped = data.groupby([key1, key2])" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [], "source": [ "grouped=df.groupby('Healthy?')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "name attribute displays all data grouped under one group. same representation is repeated for all possible groups." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "No\n", " Random food Day of Week How many times/day? Quantity Price in rupees \\\n", "1 Eggs Tuesday 2 1 41 \n", "5 Cheese cubes Saturday 1 3 169 \n", "6 Pizza Sunday 2 4 53 \n", "9 Tomato Monday 1 1 43 \n", "10 Eggs Tuesday 1 1 49 \n", "12 Cheese cubes Thursday 2 3 179 \n", "\n", " Healthy? \n", "1 No \n", "5 No \n", "6 No \n", "9 No \n", "10 No \n", "12 No \n", "Yes\n", " Random food Day of Week How many times/day? Quantity Price in rupees \\\n", "0 Chicken Monday 1 2 124 \n", "2 Sandwich Monday 1 1 192 \n", "3 Potato Wednesday 1 3 107 \n", "4 Eggs Sunday 2 3 113 \n", "7 Oats Friday 1 2 86 \n", "8 Potato Wednesday 3 1 37 \n", "11 Chicken Monday 2 2 51 \n", "13 Sandwich Thursday 1 4 15 \n", "\n", " Healthy? \n", "0 Yes \n", "2 Yes \n", "3 Yes \n", "4 Yes \n", "7 Yes \n", "8 Yes \n", "11 Yes \n", "13 Yes \n" ] } ], "source": [ "for name,group in grouped:\n", " print(name)\n", " print(group)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "get_group('groupname') returns all the data under that particular group" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Random food Day of Week How many times/day? Quantity Price in rupees \\\n", "0 Chicken Monday 1 2 124 \n", "2 Sandwich Monday 1 1 192 \n", "3 Potato Wednesday 1 3 107 \n", "4 Eggs Sunday 2 3 113 \n", "7 Oats Friday 1 2 86 \n", "8 Potato Wednesday 3 1 37 \n", "11 Chicken Monday 2 2 51 \n", "13 Sandwich Thursday 1 4 15 \n", "\n", " Healthy? \n", "0 Yes \n", "2 Yes \n", "3 Yes \n", "4 Yes \n", "7 Yes \n", "8 Yes \n", "11 Yes \n", "13 Yes \n" ] } ], "source": [ "print(grouped.get_group('Yes'))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "grouped1 = df.groupby('Day of Week')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An aggregated function returns a single aggregated value for each group. Once the group by object is created, several aggregation operation can be performed on the collection of data.\n", "In the example below, agg method is used compute mean of Price in Rupees on the groups created by grouped1." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Day of Week\n", "Friday 86.0\n", "Monday 102.5\n", "Saturday 169.0\n", "Sunday 83.0\n", "Thursday 97.0\n", "Tuesday 45.0\n", "Wednesday 72.0\n", "Name: Price in rupees, dtype: float64\n" ] } ], "source": [ "print(grouped1['Price in rupees'].agg(np.mean))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also perform a number of opeartions in one go." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " mean amax sum\n", "Day of Week \n", "Friday 86.0 86 86\n", "Monday 102.5 192 410\n", "Saturday 169.0 169 169\n", "Sunday 83.0 113 166\n", "Thursday 97.0 179 194\n", "Tuesday 45.0 49 90\n", "Wednesday 72.0 107 144\n" ] } ], "source": [ "print(grouped1['Price in rupees'].agg([np.mean,np.max,np.sum]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "size function represents the size of each group." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Day of Week\n", "Friday 1\n", "Monday 4\n", "Saturday 1\n", "Sunday 2\n", "Thursday 2\n", "Tuesday 2\n", "Wednesday 2\n", "Name: Quantity, dtype: int32\n" ] } ], "source": [ "print(grouped1['Quantity'].agg(np.size))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Transformation on a group or a column returns an object that is indexed the same as the one being grouped.Look at the example for better understanding. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "score = lambda x: (x / x.sum())*100" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " How many times/day? Quantity Price in rupees\n", "0 20.000000 33.333333 30.243902\n", "1 66.666667 50.000000 45.555556\n", "2 20.000000 16.666667 46.829268\n", "3 25.000000 75.000000 74.305556\n", "4 50.000000 42.857143 68.072289\n", "5 100.000000 100.000000 100.000000\n", "6 50.000000 57.142857 31.927711\n", "7 100.000000 100.000000 100.000000\n", "8 75.000000 25.000000 25.694444\n", "9 20.000000 16.666667 10.487805\n", "10 33.333333 50.000000 54.444444\n", "11 40.000000 33.333333 12.439024\n", "12 66.666667 42.857143 92.268041\n", "13 33.333333 57.142857 7.731959\n" ] } ], "source": [ "print(grouped1.transform(score))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Filter function as the name saya filters out all the data that doesn't fit in the given criteria. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Random food Day of Week How many times/day? Quantity Price in rupees \\\n", "6 Pizza Sunday 2 4 53 \n", "7 Oats Friday 1 2 86 \n", "9 Tomato Monday 1 1 43 \n", "\n", " Healthy? \n", "6 No \n", "7 Yes \n", "9 No \n" ] } ], "source": [ "print(df.groupby('Random food').filter(lambda x: len(x) <= 1))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }
Reshaping using PivotTable:
Create a spreadsheet-style pivot table as a DataFrame. Data is often stored in CSV files or databases in so-called “stacked” or “record” format. Similarly, we can use the pivot function to quickly summarize our data. For understanding the application of pivot tables, I decided to work on an actual dataset. So go ahead and download the Black Friday dataset from Kaggle here. This dataset records more than 50,000 records of consumer purchases on a Black Friday Sale. Say you want an answer to what is the total purchase done by men and women of each age-group? Or what is the mean purchase done by men and women of different city categories?
Simple, sketch a pivot table with the code below,
df = pd.read_csv('YOUR_FILEPATH_HERE/BlackFriday.csv') #create a pivot table table = pd.pivot_table(df,values = 'Purchase', index=['Gender'],columns='Age',aggfunc=np.sum) #create a pivot table with nested index table1 = pd.pivot_table(df,values = 'Purchase', index=['Gender','City_Category'],columns='Age',aggfunc=np.mean)


The syntax of pivot_table has four important parameters, one is values which is the column or list of columns to be aggregated. The index and columns can be a single column or a list of columns. The aggfunc is a function or list of functions that are applied to aggregate data. We can use mean, min, max, sum, median and many more functions on our aggregate.
Working with Missing data:
Many times the data that we have to work on contains missing values, i.e. either the data was never recorded or it simply didn’t exist. By “missing” we mean NA(‘not available’). While NaN is the default missing value marker, we need to be able to easily detect this value in all kinds of different types: float, int, boolean, general object. In many cases, however, there are chances of getting ‘None’ which also should be considered as missing or not available.
#Use the same Black Friday dataset df['Product_Category_2'].isna() #decide which rows have Nan values in Product_category_2 column. #that seems a long list if trues and falses #we can derive nothing out of it. #let's summarize this output df['Product_Category_2'].isna().value_counts() #we get a count of each True and False category #O/p: #True 373299 #False 164278 #Name: Product_Category_3, dtype: int64 #Count all values that are not equal to Nan df['Age'].notna().value_counts() #count all values that are not Nan df['Product_Category_3'].notna().value_counts() #we get around 3 lakh Nan values are present in this column #o/p: #False 373299 #True 164278 #Name: Product_Category_3, dtype: int64 #fill data using fillna() method df['Product_Category_2'].fillna(0) #OR df['Product_Category_2'].fillna('missing') #delete the entire column that has nan df.dropna(axis=0) #drop all rows that have nan values df.dropna(axis=1) #drop all columns that have nan values df['Product_Category_2'].dropna() #drop specific column #using interpolate method to fill nan values s = pd.Series([1,3,5,7,9,np.nan,13]) s.interpolate() #o/p: #0 1.0 #1 3.0 #2 5.0 #3 7.0 #4 9.0 #5 11.0 #6 13.0 #dtype: float64 s = pd.Series([0,5,10,15,20,25,np.nan,30]) s.interpolate() #o/p: #0 0.0 #1 5.0 #2 10.0 #3 15.0 #4 20.0 #5 25.0 #6 27.5 #7 30.0 #dtype: float64
- isna() checks the given data object for missing values. If there are any nan values, then the command returns True otherwise False.
- Likewise, notna() checks for true or actual values. it returns true if any true data is found and false otherwise.
- We use the value_counts() method to count the number of instances in each case.
- Sometimes while working on the data, it is necessary to level up the missing data with the actual data, so we use fillna() to fill in the missing values and then continue our analysis. We can replace the nans with mean, mode or median of the relevant data object.
- Sometimes, when the data has too many missing or nan values, it is advisable to neglect the entire data object. For this, we use dropna(). The method dropna() can be used in three ways as mentioned in the beside gist, the parameter axis=0 drops all the rows of the entire dataset that has nan values. axis = 1 drops all columns that have nan values. Using a particular column name with the dataframe definition drops the entire column without affecting the adjacent data points.
- interpolate() method is used to interpolate meaning data into the missing values. Refer thisdocument for further reading.
Concatenate Data Objects:
Often there will be a need to combine or concatenate two Series or two Dataframes. In this Section, you would learn to do the same. I personally use this method when I have to combine training and testing samples into one single dataframe. So go ahead and practice the below commands in your own notebook.
{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Concatenate Series " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "concatenation of Series and DataFrame in Pandas is a flexible way to combine objects along a particular axis. We can also add a tier of hierarchical indexing on concatenation axis, which may be useful if the labels are same on the passed axis number." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Concatenate two Series object based on axis = 0 " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Rose\n", "1 Lily\n", "0 Shoe Flower\n", "1 Aster\n", "dtype: object" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1 = pd.Series(['Rose','Lily'])\n", "s2 = pd.Series(['Shoe Flower','Aster'])\n", "pd.concat([s1,s2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If ignore_index is True, then no matter what indices the two to-be concatenated objects, they will be numbered as starting from 0 to (len-1). This is useful if you are concatenating two objects who have no necessarily meaningful information in the index. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "0 Rose\n", "1 Lily\n", "2 Shoe Flower\n", "3 Aster\n", "dtype: object" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([s1,s2],ignore_index=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Construct a hierarchical index using keys parameter to distinguish between the indices. This index will be used to represent keys on the outermost level. If you provide keys, there won't be a need for ignore_index." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Tier 1 0 Rose\n", " 1 Lily\n", "Tier 2 0 Shoe Flower\n", " 1 Aster\n", "dtype: object" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([s1,s2],keys=['Tier 1','Tier 2'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can give a name to each level of index using names parameter." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "Tier level Row ID\n", "Tier 1 0 Rose\n", " 1 Lily\n", "Tier 2 0 Shoe Flower\n", " 1 Aster\n", "dtype: object" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([s1,s2],keys=['Tier 1','Tier 2'],names=['Tier level','Row ID'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Concatenate DataFrames" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "df1 = pd.DataFrame([['a',97],['b',98],['c',99]],columns=['Letter','ASCII value'])" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Letter</th>\n", " <th>ASCII value</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>a</td>\n", " <td>97</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>b</td>\n", " <td>98</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>c</td>\n", " <td>99</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Letter ASCII value\n", "0 a 97\n", "1 b 98\n", "2 c 99" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "df2 = pd.DataFrame([['x',120],['y',121],['z',122]],columns=['Letter','ASCII value'])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Letter</th>\n", " <th>ASCII value</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>x</td>\n", " <td>120</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>y</td>\n", " <td>121</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>z</td>\n", " <td>122</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Letter ASCII value\n", "0 x 120\n", "1 y 121\n", "2 z 122" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "Create a concatenation of dataframes " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Letter</th>\n", " <th>ASCII value</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>a</td>\n", " <td>97</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>b</td>\n", " <td>98</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>c</td>\n", " <td>99</td>\n", " </tr>\n", " <tr>\n", " <th>0</th>\n", " <td>x</td>\n", " <td>120</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>y</td>\n", " <td>121</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>z</td>\n", " <td>122</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Letter ASCII value\n", "0 a 97\n", "1 b 98\n", "2 c 99\n", "0 x 120\n", "1 y 121\n", "2 z 122" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df1,df2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apply ignore_index so that the inex appears as continuous representation from 0 to (len-1)." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Letter</th>\n", " <th>ASCII value</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>a</td>\n", " <td>97</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>b</td>\n", " <td>98</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>c</td>\n", " <td>99</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>x</td>\n", " <td>120</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>y</td>\n", " <td>121</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>z</td>\n", " <td>122</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Letter ASCII value\n", "0 a 97\n", "1 b 98\n", "2 c 99\n", "3 x 120\n", "4 y 121\n", "5 z 122" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df1,df2],ignore_index=True)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th></th>\n", " <th>Letter</th>\n", " <th>ASCII value</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th rowspan=\"3\" valign=\"top\">Starting 3 letters</th>\n", " <th>0</th>\n", " <td>a</td>\n", " <td>97</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>b</td>\n", " <td>98</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>c</td>\n", " <td>99</td>\n", " </tr>\n", " <tr>\n", " <th rowspan=\"3\" valign=\"top\">Ending 3 letters</th>\n", " <th>0</th>\n", " <td>x</td>\n", " <td>120</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>y</td>\n", " <td>121</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>z</td>\n", " <td>122</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Letter ASCII value\n", "Starting 3 letters 0 a 97\n", " 1 b 98\n", " 2 c 99\n", "Ending 3 letters 0 x 120\n", " 1 y 121\n", " 2 z 122" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df1,df2],keys=['Starting 3 letters','Ending 3 letters'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If one particular dataframe doesn't have the columns from another dataframe, such data is simply represented as Nan." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": true }, "outputs": [], "source": [ "df3 = pd.DataFrame([['a',97,61],['b',98,62],['c',99,63]],columns=['Letter','ASCII value','HEX value'])" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Letter</th>\n", " <th>ASCII value</th>\n", " <th>HEX value</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>a</td>\n", " <td>97</td>\n", " <td>61</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>b</td>\n", " <td>98</td>\n", " <td>62</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>c</td>\n", " <td>99</td>\n", " <td>63</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Letter ASCII value HEX value\n", "0 a 97 61\n", "1 b 98 62\n", "2 c 99 63" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df3" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>ASCII value</th>\n", " <th>HEX value</th>\n", " <th>Letter</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>97</td>\n", " <td>NaN</td>\n", " <td>a</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>98</td>\n", " <td>NaN</td>\n", " <td>b</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>99</td>\n", " <td>NaN</td>\n", " <td>c</td>\n", " </tr>\n", " <tr>\n", " <th>0</th>\n", " <td>97</td>\n", " <td>61.0</td>\n", " <td>a</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>98</td>\n", " <td>62.0</td>\n", " <td>b</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>99</td>\n", " <td>63.0</td>\n", " <td>c</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " ASCII value HEX value Letter\n", "0 97 NaN a\n", "1 98 NaN b\n", "2 99 NaN c\n", "0 97 61.0 a\n", "1 98 62.0 b\n", "2 99 63.0 c" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df1,df3],sort=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "the parameter join = \"inner\" let's us concat only those data points that are mutually present in all participating dataframes." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Letter</th>\n", " <th>ASCII value</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>a</td>\n", " <td>97</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>b</td>\n", " <td>98</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>c</td>\n", " <td>99</td>\n", " </tr>\n", " <tr>\n", " <th>0</th>\n", " <td>a</td>\n", " <td>97</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>b</td>\n", " <td>98</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>c</td>\n", " <td>99</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Letter ASCII value\n", "0 a 97\n", "1 b 98\n", "2 c 99\n", "0 a 97\n", "1 b 98\n", "2 c 99" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df1,df3],join = \"inner\")" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "df4 = pd.DataFrame([['apple','An apple a day!'],['boy','The boy plays football']],\n", " columns=['Words','Sentence'])" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Words</th>\n", " <th>Sentence</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>apple</td>\n", " <td>An apple a day!</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>boy</td>\n", " <td>The boy plays football</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Words Sentence\n", "0 apple An apple a day!\n", "1 boy The boy plays football" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to combine dataframes column-wise, ie the dataframe will be appended one after the other, use aixs = 1 parameter." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Letter</th>\n", " <th>ASCII value</th>\n", " <th>Words</th>\n", " <th>Sentence</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>a</td>\n", " <td>97</td>\n", " <td>apple</td>\n", " <td>An apple a day!</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>b</td>\n", " <td>98</td>\n", " <td>boy</td>\n", " <td>The boy plays football</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>c</td>\n", " <td>99</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Letter ASCII value Words Sentence\n", "0 a 97 apple An apple a day!\n", "1 b 98 boy The boy plays football\n", "2 c 99 NaN NaN" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df1,df4],axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use verify_integrity=True if you want to be sure that whatever dataframe's index is present, there should not be duplicate values." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>a</th>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0\n", "a 1" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df5 = pd.DataFrame([1],index=['a'])\n", "df5" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>a</th>\n", " <td>2</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0\n", "a 2" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df6 = pd.DataFrame([2],index=['a'])\n", "df6" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "scrolled": true }, "outputs": [ { "ename": "ValueError", "evalue": "Indexes have overlapping values: Index(['a'], dtype='object')", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m<ipython-input-22-effc43a713f2>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mpd\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mconcat\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mdf5\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdf6\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mverify_integrity\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mTrue\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;32m~\\Anaconda3\\lib\\site-packages\\pandas\\core\\reshape\\concat.py\u001b[0m in \u001b[0;36mconcat\u001b[1;34m(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)\u001b[0m\n\u001b[0;32m 223\u001b[0m \u001b[0mkeys\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mkeys\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mlevels\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mlevels\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnames\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mnames\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 224\u001b[0m \u001b[0mverify_integrity\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mverify_integrity\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 225\u001b[1;33m copy=copy, sort=sort)\n\u001b[0m\u001b[0;32m 226\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mop\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget_result\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 227\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32m~\\Anaconda3\\lib\\site-packages\\pandas\\core\\reshape\\concat.py\u001b[0m in \u001b[0;36m__init__\u001b[1;34m(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy, sort)\u001b[0m\n\u001b[0;32m 376\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mcopy\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mcopy\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 377\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 378\u001b[1;33m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mnew_axes\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_get_new_axes\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 379\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 380\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0mget_result\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32m~\\Anaconda3\\lib\\site-packages\\pandas\\core\\reshape\\concat.py\u001b[0m in \u001b[0;36m_get_new_axes\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m 456\u001b[0m \u001b[0mnew_axes\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mi\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0max\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 457\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 458\u001b[1;33m \u001b[0mnew_axes\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0maxis\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_get_concat_axis\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 459\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mnew_axes\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 460\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32m~\\Anaconda3\\lib\\site-packages\\pandas\\core\\reshape\\concat.py\u001b[0m in \u001b[0;36m_get_concat_axis\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m 514\u001b[0m self.levels, self.names)\n\u001b[0;32m 515\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 516\u001b[1;33m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_maybe_check_integrity\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mconcat_axis\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 517\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 518\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mconcat_axis\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32m~\\Anaconda3\\lib\\site-packages\\pandas\\core\\reshape\\concat.py\u001b[0m in \u001b[0;36m_maybe_check_integrity\u001b[1;34m(self, concat_index)\u001b[0m\n\u001b[0;32m 523\u001b[0m \u001b[0moverlap\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mconcat_index\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mconcat_index\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mduplicated\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0munique\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 524\u001b[0m raise ValueError('Indexes have overlapping values: '\n\u001b[1;32m--> 525\u001b[1;33m '{overlap!s}'.format(overlap=overlap))\n\u001b[0m\u001b[0;32m 526\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 527\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mValueError\u001b[0m: Indexes have overlapping values: Index(['a'], dtype='object')" ] } ], "source": [ "pd.concat([df5, df6], verify_integrity=True)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }
A long article, wasn’t it? But I hope that it was informative and easy to understand as well. One can read thousands of article and move ahead. However, the real understanding of things comes when we get hands-on practice. I insist that you practice the above-mentioned examples in the gists thoroughly in your system.
As Promised, I have created a Github repository that contains problems with their respective solutions for each topic that is explained above. Clone the repository and practice, practice, practice!
Find the repository here.
Source: towardsdatascience