Learn to Code via Tutorials on Repl.it!

← Back to all posts
📊 Handling Large Amounts of Data In Python 📊
OldWizard209 (1104)

Introduction

Data Scientist has become one of the most in-demand and exciting job of the 21st Century. And surprisingly, Python is also considered the best programming language for machine learning.


So, what point am I making here?

While dealing with AI and Machine Learning, we need to handle huge amounts of data. By providing data to the computer, it can find patterns in the data and use it to help us. Everything from YouTube to Facebook uses Machine Learning to display ads and video recommendations. For this reason, today I am going to be showing you a few ways to handle data.

Lists.

Let's say we have 1000 names of people who have visited our website in the past week(you can find the names linked here). Now, with the 1000 names, we need to make them into a list so that we can find patterns in them, for example, who visited our website the most, or maybe we want to pick a random user and gift him premium membership for a year.


How it works?

Now putting every single name into a list would waste our time and effort. There are many ways we can do it in our main.py file. For example, one way to do so would be to import the re module and do this:

import re
names = re.split('\s+', '<Put all the names here>')
print(names)

But this will mess up our code real bad.

Solution:

  • Firstly, we can make a new file called names.txt and put all our names over there instead. Remember to put the .txt after the names as it will denote a text file just like how we put a .py to denote a python file.

  • Next, we will go back to our main.py and we will now start writing some code.


  • Opening the file:
    • We need to first open the names.txt file so that we can read the contents in it.
    • For that, we need to use a built-in Python function called open():
      name = open("names.txt", "r")
      Woah, Woah, Woah... Relax buddy, what did I do there?
    • Arguments:
      • Well, when you use open(), you need to specify which file you are opening. So to do that we do open("<file name>") and put the file name which is taken as a first argument for open().
      • Next is the mode in which I opened the file in. Currently, I have opened the file in read mode. For that, I have to use 'r' as the second parameter: open("names.txt", "r")
      • There are many modes in which you can open the file in. For example, to open it in writing mode, I can do 'w' and be able to write to the file. Here is a StackAbuse article talking about file handling.

  • File Methods:
    • After we have opened our file and stored it in a variable called names we now have to read through all of the names in the text file. To do that we use the file method called .readlines() which will return a list of lines(of what is in names.txt): names_list = names.readlines()
    • You can learn more about file methods here.
    • Finally, we have to close the file using .close(). We have to do this because there is a limit to how many files can be opened on your device and we also have to close it because an open file can cause some problems.

  • What next?

    • Here is where the problem comes in. Now that we have turned our names into a list, we can work with it but the problem with the list is that after every element in the list, there is a \n which means that after every name, we go to the next line. This way, instead of the name Wilbur Whitson, we get Wilbur Whitson\n. You can try this out yourself by printing the list.

    • This will cause a lot of problems and to avoid that, we can use list-comprehension. So instead of using the list with \n after every element, we do this:

      names = open('names.txt', 'r')
      names_list = names.readlines()
      names_list_fixed = [line.strip() for line in names_list]
      names.close()
    • We strip every element in names.txt and put it into our new list names_list_fixed.
    • Now that we have a clean list, we can work with it. Here is how you would write a program for our premium membership winning customer:

      import random
      
      names = open('names.txt', 'r')
      names_list = names.readlines()
      names_list_fixed = [line.strip() for line in names_list]
      names.close()
      
      winner = random.choice(names_list_fixed)
      print(f"Congrats {winner}, you won our premier membership.")
    • This, is just one of the million things you can do once you have your data sorted. The only limit is your imagination.


CSV Files:

Sometimes we need to store complex data in a file rather than just strings. For that, we use CSV(comma-separated values) files. Here is an example of what a CSV File looks like:

app,ceo,date,type
Youtube,Susan Wojcicki,2005,video sharing
Facebook,Mark Zuckerburg,2004,social networking
Google,Sundar Pichai,1998,search engine
Amazon,Jeff Bezos,1994,e-commerce
Apple,Tim Cook,1976,tech

The first line in the file refers to the column beneath it. For example, all the data that is in the first column is the names of companies. The second last(column) refers to the founding dates of those companies. So how do we work with CSV Files using Python?

Working With CSV Files

  • Opening
    • Firstly, like we did with the .txt file, we do the same with CSV files. We put the data into the file and name it companies.csv. You can name it anything you want lol.
    • Next, we open the file with the same method:
      data = open('companies.csv', 'r')
      data_list= data.readlines()
      data.close()
  • List Comprehension
    • Once, again we need to strip() our data so that we don't have the \n after every element. But this time we do not need the first column so we simply do this:
      data = open('companies.csv', 'r')
      data_list= data.readlines()
      data_list_fixed = [line.strip() for line in data_list[1:]]
      data.close()
    • This takes every element in the list, except the first one because those are the headings.
  • Iterating over each line

    • Now that we have a list, we can split each element in the list using a comma and iterate over them so that we have a list for each element:
      for line in data_list_fixed:
          data_inv = line.split(",")
          print(data_inv)
    • This gives us these beautiful lists:

      ['Youtube', 'Susan Wojcicki', '2005', 'video sharing']
      ['Facebook', 'Mark Zuckerburg', '2004', 'social networking']
      ['Google', 'Sundar Pichai', '1998', 'search engine']
      ['Amazon', 'Jeff Bezos', '1994', 'e-commerce']
      ['Apple', 'Tim Cook', '1976', 'tech']
    • Finally, we can print out something like this:

      for line in data_list_fixed:
          data_inv = line.split(",")
          company = data_inv[0]
          ceo = data_inv[1]
          founded = data_inv[2]
          company_type = data_inv[3]
          print(f"{ceo} is the CEO of {company}, which was founded in {founded}, and specializes in {company_type}\n")

Conclusion.

In this tutorial, I have shown how to manage data from CSV and .txt files. You can pretty much do anything when you know how to sort them out. Here is a very popular CSV File about Video Game Sales.

Outro

This took me a long time to make so I would love and appreciate an upvote.

Credits:

Comments
hotnewtop
IronStarkMan (11)

This is very interesting @OldWizard209 . I am learning Pyhton from w3schools and coursera an practicing on PyCharm. Can I fork this repl and work on it. And please can you provide me with some links from where I can learn more on this??? Thanks

OldWizard209 (1104)

Sure, fork it or you can just copy paste the code on PyCharms??
There are many tutorials online:
On YouTube there is
a full Python course. Go to the machine learning section and the person exaplins with great deal about ML. @IronStarkMan
There are some free MIT courses about Machine Learning as well. Search it up.... Ping me if you need more places where u can learn....

Bookie0 (5937)

##How it works?

markdown error?

And good you cited your sources lol I haven't seen that much! :D

OldWizard209 (1104)

Fixed the error. I cited the sources just to make my post look better LMAO @Bookie0

Bookie0 (5937)

@OldWizard209 lol it's always good to have sources (to make the post good or not ;)