📊 Handling Large Amounts of Data In Python 📊
Introduction
Data Scientist has become one of the most in-demand and exciting job of the 21st Century. And surprisingly, Python is also considered the best programming language for machine learning.
So, what point am I making here?
While dealing with AI and Machine Learning, we need to handle huge amounts of data. By providing data to the computer, it can find patterns in the data and use it to help us. Everything from YouTube to Facebook uses Machine Learning to display ads and video recommendations. For this reason, today I am going to be showing you a few ways to handle data.
Lists.
Let's say we have 1000 names of people who have visited our website in the past week(you can find the names linked here). Now, with the 1000 names, we need to make them into a list so that we can find patterns in them, for example, who visited our website the most, or maybe we want to pick a random user and gift him premium membership for a year.
How it works?
Now putting every single name into a list would waste our time and effort. There are many ways we can do it in our main.py
file. For example, one way to do so would be to import
the re
module and do this:
import re names = re.split('\s+', '<Put all the names here>') print(names)
But this will mess up our code real bad.
Solution:
-
Firstly, we can make a new file called
names.txt
and put all our names over there instead. Remember to put the.txt
after thenames
as it will denote a text file just like how we put a.py
to denote a python file. -
Next, we will go back to our
main.py
and we will now start writing some code.
- Opening the file:
- We need to first open the
names.txt
file so that we can read the contents in it. - For that, we need to use a built-in Python function called
open()
:
Woah, Woah, Woah... Relax buddy, what did I do there?name = open("names.txt", "r")
- Arguments:
- Well, when you use
open()
, you need to specify which file you are opening. So to do that we doopen("<file name>")
and put the file name which is taken as a first argument foropen()
. - Next is the mode in which I opened the file in. Currently, I have opened the file in
read
mode. For that, I have to use'r'
as the second parameter:open("names.txt", "r")
- There are many modes in which you can open the file in. For example, to open it in writing mode, I can do
'w'
and be able to write to the file. Here is a StackAbuse article talking about file handling.
- Well, when you use
- We need to first open the
- File Methods:
- After we have opened our file and stored it in a variable called
names
we now have to read through all of the names in the text file. To do that we use the file method called .readlines()
which will return a list of lines(of what is innames.txt
):names_list = names.readlines()
- You can learn more about file methods here.
- Finally, we have to close the file using
.close()
. We have to do this because there is a limit to how many files can be opened on your device and we also have to close it because an open file can cause some problems.
- After we have opened our file and stored it in a variable called
- What next?
-
Here is where the problem comes in. Now that we have turned our names into a list, we can work with it but the problem with the list is that after every element in the list, there is a
\n
which means that after every name, we go to the next line. This way, instead of the nameWilbur Whitson
, we getWilbur Whitson\n
. You can try this out yourself by printing the list. -
This will cause a lot of problems and to avoid that, we can use list-comprehension. So instead of using the list with
\n
after every element, we do this:
names = open('names.txt', 'r') names_list = names.readlines() names_list_fixed = [line.strip() for line in names_list] names.close()
- We strip every element in
names.txt
and put it into our new listnames_list_fixed
. - Now that we have a clean list, we can work with it. Here is how you would write a program for our premium membership winning customer:
import random names = open('names.txt', 'r') names_list = names.readlines() names_list_fixed = [line.strip() for line in names_list] names.close() winner = random.choice(names_list_fixed) print(f"Congrats {winner}, you won our premier membership.")
- This, is just one of the million things you can do once you have your data sorted. The only limit is your imagination.
-
CSV Files:
Sometimes we need to store complex data in a file rather than just strings. For that, we use CSV(comma-separated values) files. Here is an example of what a CSV File looks like:
app,ceo,date,type Youtube,Susan Wojcicki,2005,video sharing Facebook,Mark Zuckerburg,2004,social networking Google,Sundar Pichai,1998,search engine Amazon,Jeff Bezos,1994,e-commerce Apple,Tim Cook,1976,tech
The first line in the file refers to the column beneath it. For example, all the data that is in the first column is the names of companies. The second last(column) refers to the founding dates of those companies. So how do we work with CSV Files using Python?
Working With CSV Files
-
Opening
- Firstly, like we did with the
.txt
file, we do the same with CSV files. We put the data into the file and name itcompanies.csv
. You can name it anything you want lol. - Next, we open the file with the same method:
data = open('companies.csv', 'r') data_list= data.readlines() data.close()
- Firstly, like we did with the
-
List Comprehension
- Once, again we need to
strip()
our data so that we don't have the\n
after every element. But this time we do not need the first column so we simply do this:
data = open('companies.csv', 'r') data_list= data.readlines() data_list_fixed = [line.strip() for line in data_list[1:]] data.close()
- This takes every element in the list, except the first one because those are the headings.
- Once, again we need to
-
Iterating over each line
- Now that we have a list, we can split each element in the list using a comma and iterate over them so that we have a list for each element:
for line in data_list_fixed: data_inv = line.split(",") print(data_inv)
- This gives us these beautiful lists:
['Youtube', 'Susan Wojcicki', '2005', 'video sharing'] ['Facebook', 'Mark Zuckerburg', '2004', 'social networking'] ['Google', 'Sundar Pichai', '1998', 'search engine'] ['Amazon', 'Jeff Bezos', '1994', 'e-commerce'] ['Apple', 'Tim Cook', '1976', 'tech']
- Finally, we can print out something like this:
for line in data_list_fixed: data_inv = line.split(",") company = data_inv[0] ceo = data_inv[1] founded = data_inv[2] company_type = data_inv[3] print(f"{ceo} is the CEO of {company}, which was founded in {founded}, and specializes in {company_type}\n")
Conclusion.
In this tutorial, I have shown how to manage data from CSV and .txt
files. You can pretty much do anything when you know how to sort them out. Here is a very popular CSV File about Video Game Sales.
Outro
This took me a long time to make so I would love and appreciate an upvote.
Credits:
This is very interesting @OldWizard209 . I am learning Pyhton from w3schools and coursera an practicing on PyCharm. Can I fork this repl and work on it. And please can you provide me with some links from where I can learn more on this??? Thanks
Sure, fork it or you can just copy paste the code on PyCharms??
There are many tutorials online:
On YouTube there is
a full Python course. Go to the machine learning section and the person exaplins with great deal about ML. @IronStarkMan
There are some free MIT courses about Machine Learning as well. Search it up.... Ping me if you need more places where u can learn....
markdown error?
And good you cited your sources lol I haven't seen that much! :D
Fixed the error. I cited the sources just to make my post look better LMAO @Bookie0
@OldWizard209 lol it's always good to have sources (to make the post good or not ;)
True @Bookie0
mhm
@OldWizard209