Global variables are the ones that are defined and declared outside a function, and we need to use them inside a function. A variable declared inside the function’s body or the local scope is known as a local variable.
A namespace is a naming system that is used to ensure that every object has a unique name. It is like space (for visual purposes, think of this space as a container) is assigned to every variable which is mapped to the object. So, when we call out this variable, this assigned space or container is searched and hence the corresponding object as well. Python maintains a dictionary for this purpose.
Pass: It is used when you need some block of code syntactically, but you want to skip its execution. This is basically a null operation. Nothing happens when this is executed.
Continue: It allows to skip some part of a loop when some specific condition is met, and the control is transferred to the beginning of the loop. The loop does not terminate but continues with the next iteration.
Break: It allows the loop to terminate when some condition is met, and the control of the program flows to the statement immediately after the body of the loop. If the break statement is inside a nested loop (the loop inside another loop), then the break statement will terminate the innermost loop.
range(): returns a Python list object, which is of integers. It is a function of BASE python.
xrange(): returns a range object.
arange(): is a function in Numpy library. It can return fractional values as well.
del(): deletes the with respect to the position of the value. It does not return which value is deleted. It also changes the index towards the right by decreasing one value. It can also be used to delete the entire data structure.
clear(): clears the list.
remove(): it deletes with respect to the value hence can be used if you know which particular value to delete.
pop(): by default removes the last element and also returns back which value is deleted. It is used extensively when we would want to create referencing. In sense, we can store this deleted return value in a variable and use in future.
Map function applies the given function to all the iterable and returns a new modified list. It applies the same function to each element of a sequence.
Reduce function applies the same operation to items of a sequence. It uses the result of operations as the first param of the next operation. It returns an item and not a list.
Filter function filters an item out of a sequence. It is used to filter the given iterable (list, sets, tuple) with the help of another function passed as an argument to test all the elements to be true or false. Its output is a filtered list.
Indexing is extracting or lookup one or particular values in a data structure, whereas slicing retrieves a sequence of elements.
‘==’ checks for equality between the variables, and ‘is’ checks for the identity of the variables.
A generator is a function returning an iterable or object over which can iterate that is by taking one value at a time. A decorator allows us to modify or alter the functions, methods, and classes.
A tuple can be unpacked in sense its elements can be separated in the following manner:
Example: We have tuple x = (500, 352)
This tuple x can be assigned to two new variables in this way: a,b = x
Now, printing a and b will result in: print(a) = 500 and print(b) = 352
Tuple unpacking helps to separate each value one at a time. In Machine Learning algorithms, we usually get output as a tuple. Let’s say x = (avg, max), and we want to use these values separately for further analysis then can use the unpacking feature of tuples.
The data type that is constructed using simple, primitive, and basic data types are compound data types. Data Structures in Python allow us to store multiple observations. These are lists, tuples, sets, and dictionaries.
The mutability of a data structure is the ability to change the portion of the data structure without having to recreate it. Mutable objects are lists, sets, values in a dictionary.
Immutability is the state of the data structure that cannot be changed after its creation. Immutable objects are integers, strings, float, bool, tuples, keys of a dictionary.
A module is a single file (or files) containing functions, definitions, and variables designed to do certain tasks. It is a .py extension file. It can be imported at any time during a session and needs to be imported only once. To import a python module, there are two ways: import or from module_name import.
A library is a collection of reusable functionality of codes that allows us to perform a variety of tasks without having to write the code. A Python library does not have any specific context to it. It loosely refers to a collection of modules. These codes can be used by importing the library and by calling that library’s method (or attribute) with a period(.).
Some of the statistical functions in Python Pandas are,
sum() - it returns the sum of the values.
mean() - returns the mean that is the average of the values.
std() - returns the standard deviation of the numerical columns.
min() - returns the minimum value.
max() - returns the maximum value.
abs() - returns the absolute value.
prod() - returns the product of the values.
The function to_numpy() is used to convert the DataFrame to a NumPy array.
DataFrame.to_numpy(self, dtype=None, copy=False)
The dtype parameter defines the data type to pass to the array and the copy ensures the returned value is not a view on another array.
Vectorization is the process of running operations on the entire array. This is done to reduce the amount of iteration performed by the functions. Pandas have a number of vectorized functions like aggregations, and string functions that are optimized to operate specifically on series and DataFrames. So it is preferred to use the vectorized pandas functions to execute the operations quickly.
dataframe.iterrows() is used to iterate over a pandas Data frame rows in the form of (index, series) pair such that it iterates over the data frame column and return a tuple with the column name and content in form of series.
Matplotlib is the most popular data visualization library that is used to plot the data. This comprehensive library is used for creating a static, animated, and interactive visualization with the data. It Developed by John D. Hunter, this open-source library was first released in 2003. Matplotlib also provides various toolkits that extend the functionalities of it. Such toolkits are Basemap, Cartopy, Excel tool, GTK tools, and more.
Reindexing means to conform DataFrame to a new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. It changes the row labels and column labels of a DataFrame.
pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). It consists of three principal components, the data, rows, and columns. pandas DataFrame can be created from the lists, dictionary, and from a list of dictionary, etc.
To create an empty DataFrame in pandas, type
import pandas as pd
df = pd.DataFrame()
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:
>>> s = pd.Series(data, index=index), where the data can be a Python dict, an ndarray or a scalar value.
To create a copy in pandas, we can call copy() function on a series such that
s2=s1.copy() will create copy of series s1 in a new series s2.
In Pandas, groupby () function allows the programmers to rearrange data by using them on real-world sets. The primary task of the function is to split the data into various groups.
To convert a single object to an excel file, we can simply specify the target file’s name. However, to convert multiple sheets, we need to create an ExcelWriter object along with the target filename and specify the sheet we wish to export.
Numerical Python (NumPy) is defined as an inbuilt package in python to perform numerical computations and processing of multidimensional and single-dimensional array elements.
NumPy array calculates faster as compared to other Python arrays.
To iterate over DataFrame in pandas for loop can be used in combination with an iterrows () call.
.rename method can be used to rename columns or index values of DataFrame
To add rows to a DataFrame, we can use .loc (), .iloc () and .ix(). The .loc () is label based, .iloc() is integer based and .ix() is booth label and integer based. To add columns to the DataFrame, we can again use .loc () or .iloc ().
To create a copy of the series in pandas, the following syntax is used:
* if the value of deep is set to false, it will neither copy data nor the indices.
To reindex means to modify the data to match a particular set of labels along a particular axis.
Various operations can be achieved using indexing, such as-
* Insert missing value (NA) markers in label locations where no data for the label existed.
* Reorder the existing set of data to match a new set of labels.
In Pandas, groupby() function allows us to rearrange the data by utilizing them on real-world data sets. Its primary task is to split the data into various groups. These groups are categorized based on some criteria. The objects can be divided from any of their axes.
In Pandas, there are different useful data operations for DataFrame, which are as follows:
* Row and column selection
We can select any row and column of the DataFrame by passing the name of the rows and columns. When you select it from the DataFrame, it becomes one-dimensional and considered as Series.
* Filter Data
We can filter the data by providing some of the boolean expressions in DataFrame.
* Null values
A Null value occurs when no data is provided to the items. The various columns may contain no values, which are usually represented as NaN.
Reindexing is used to change the index of the rows and columns of the DataFrame. We can reindex the single or multiple rows by using the reindex() method. Default values in the new index are assigned NaN if it is not present in the DataFrame.
Multiple indexing is defined as essential indexing because it deals with data analysis and manipulation, especially for working with higher dimensional data. It also enables us to store and manipulate data with the arbitrary number of dimensions in lower-dimensional data structures like Series and DataFrame.
The main task of Data Aggregation is to apply some aggregation to one or more columns. It uses the following:
* sum: It is used to return the sum of the values for the requested axis.
* min: It is used to return a minimum of the values for the requested axis.
* max: It is used to return a maximum values for the requested axis.
The offset specifies a set of dates that conform to the DateOffset. We can create the DateOffsets to move the dates forward to valid dates.
The Time series data is defined as an essential source for information that provides a strategy that is used in various businesses. From a conventional finance industry to the education industry, it consists of a lot of details about the time.
Time series forecasting is the machine learning modeling that deals with the Time Series data for predicting future values through Time Series modeling.
We can export the DataFrame to the excel file by using the to_excel() function.
To write a single object to the excel file, we have to specify the target file name. If we want to write to multiple sheets, we need to create an ExcelWriter object with target filename and also need to specify the sheet in the file in which we have to write.
For performing some high-level mathematical functions, we can convert Pandas DataFrame to numpy arrays. It uses the DataFrame.to_numpy() function.
The DataFrame.to_numpy() function is applied to the DataFrame that returns the numpy ndarray.
Numerical Python (Numpy) is defined as a Python package used for performing the various numerical computations and processing of the multidimensional and single-dimensional array elements. The calculations using Numpy arrays are faster than the normal Python array.
We can create the copy of series by using the following syntax:
The above statements make a deep copy that includes a copy of the data and the indices. If we set the value of deep to False, it will neither copy the indices nor the data.
SymPy is a Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible.
pip is a package management system used to install and manage software packages written in Python. Many packages can be found in the Python Package Index (PyPI). Python 2.7.9 and later (on the python2 series), and Python 3.4 and later include pip (pip3 for Python 3) by default.
Plotly, also known by its URL, Plot.ly, is an online analytics and data visualization tool, headquartered in Montreal, Quebec.
SciPy (pronounced "Sigh Pie") is open-source software for mathematics, science, and engineering. It is also the name of a very popular conference on scientific programming with Python. The SciPy library depends on NumPy, which provides convenient and fast N-dimensional array manipulation.
numpy is the top package name, and doing import numpy doesn'timport submodule numpy.f2py . ... The link is established when you do import numpy.f2py. In your above code: import numpy as np # np is an alias pointing tonumpy, but at this point numpy is not linked to numpy.f2py import numpy.
matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like wxPython, Qt, or GTK+.
NumPy (pronounced /ˈnʌmpaɪ/ (NUM-py) or sometimes /ˈnʌmpi/ (NUM-pee)) is an extension to the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays.
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.
pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. pandas is free software released under the three-clause BSD license.
Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.