Ask coding questions

← Back to all posts
Gibberish Detection
RYANTADIPARTHI (6011)

Hi,

so, i'm making something, and i want to have a gibberish detector in it. Meaning like, if someone tried to post gibberish, it says True, or gives a message saying is gibberish. or else, not gibberish.

If someone knows how to do that, please let me know. This is in python. Also, it should work in repl.it

Thanks!

Answered by KENNETHTRIPP (50) [earned 5 cycles]
View Answer
Comments
hotnewtop
KENNETHTRIPP (50)

there are certain libraries in pYthon with the whole dictionary. You could go through the input and see if each of the words are in the dictionary. This, however, is a less than ideal solution because there are many "slang" terms that is doesn't recognize. You can import the library with import enchant

RYANTADIPARTHI (6011)

@KENNETHTRIPP could you show me a code example using that library? Thanks!

KENNETHTRIPP (50)

@RYANTADIPARTHI I made a jumble solver here: https://repl.it/talk/share/Jumble-Solver/86506 . basically there is a function that returns true or false if the word is a word.

RYANTADIPARTHI (6011)

@KENNETHTRIPP doesn't help. Each time it says killed.

KENNETHTRIPP (50)

@RYANTADIPARTHI On my example, or when you try to use it? Can you invite me to your project? Maybe I could help.

RYANTADIPARTHI (6011)

@ch1ck3n before I asked this question, I already tried that link. It doesn't really work. i mean it works, but not according to my needs. It's confusing what it does.

ch1ck3n (1632)

@RYANTADIPARTHI Copy that code, and classify("foo") returns a float. the higher the float is, the more likely it's gibberish. >0.5 usually means it's gibberish.

RYANTADIPARTHI (6011)

@ch1ck3n could you show me an example?

rediar (499)

@RYANTADIPARTHI The comment he gave you is the example....
Float = decimal
Higher decimal = higher percentage it is gibberish

catspython (27)

I realized that once a ask question is answered it doesn't get much more comments. So I decided to comment on an old, answered question to give a random notification and annoy them or something. Or you, @RYANTADIPARTHI.

tussiez (1532)

@RYANTADIPARTHI You can use this to check the supposed gibberish against some common English words

RYANTADIPARTHI (6011)

@tussiez well, that doesn't help. I need to check all words.

tussiez (1532)

@RYANTADIPARTHI Hmm. I'm not sure how exactly a computer would recognize something as a word though, even if it were able to detect sounds/syllables/etc as this could be bypassed with random combinations of vowels/consonants

tussiez (1532)

@tussiez You may be able to query a dictionary online, but the server might get mad and block you...

RYANTADIPARTHI (6011)

@tussiez well, the person i marked came up with the excellent solution. we both worked together on it. so, no worries.

RYANTADIPARTHI (6011)

@tussiez yeah.. but like i said, i have a solution.

rediar (499)

Simple: Use a dictionary api
Slang and names will be classified as gibberish then. So you should only classify a message as gibberish if more then half a message's words are gibberish.
Additionally, you can flag multiple constanants in a row as gibberish, as well as weird capitalization, numbers and symbols inside words, and also length of a word.

RYANTADIPARTHI (6011)

@rediar the problem with that is, it doesn't classify sentences. Long sentences.

rediar (499)

@RYANTADIPARTHI iterate through a sentence by splitting it into words, and give it a score based on the amount of gibberish in the words

RYANTADIPARTHI (6011)

@rediar yeah, np. someone already solved it anyways.

ch1ck3n (1632)

@Coder100 use the flask unblocker thingy

ch1ck3n (1632)

@ch1ck3n what the heck double comments

RYANTADIPARTHI (6011)

@Coder100 can't. All unblockers don't work here.

rediar (499)

@RYANTADIPARTHI well, you could always use a repl+selenium to avoid blocks...

Bookie0 (6031)

Solution

I've found this for python, use this package. Check it out here

And check this StackOverflow page as well!

Like that.

Thanks!

That should work

RYANTADIPARTHI (6011)

@Bookie0 nope, not a solution.

i have tried those before asking this question. the first one, seems like it's not available in repl.it, and the next link also, doesn't work.

let me know if you have any others. thanks!

Bookie0 (6031)

@RYANTADIPARTHI

oh, hum. Maybe try to install a python IDE on your computer, like IDLE, and try it there. If it works there, then yea probably means that it doesn't work on Python.
Click here to download python.

also, well, like @realTronsi, said gibberish is kinda subjective. Otherwise, I would go with @Baconman321's idea, to find something that can check if the word of the user is an existing English word like this.

Or, using this dictionary module, you could check if what the user wrote exists, since that module should contain every English word.

Or, using the Markov Chain and this github page, this is what it could do:

Go check it out for more details! :D

RYANTADIPARTHI (6011)

@Bookie0 ok. so, could you show some code examples and all for pydictionary, and the last github link. I didn't really understand how to use them properly.

but it's a bit helpful though.

Bookie0 (6031)

@RYANTADIPARTHI For the Pydictionary, I'm afraid there isn't any code or anything to find if a word is in a dict, it was just a suggestion I made up myself.

However, check this:

>>> import enchant 
>>> d = enchant.Dict("en_US") 
>>> d.check("Hello") 
True 
>>> d.check("Helo") 
False 

According to that site above, this is how to use it.


And also check this out:

You could build a model of character to character transitions from a bunch of text in English. So for example, you find out how common it is for there to be a 'h' after a 't' (pretty common). In English, you expect that after a 'q', you'll get a 'u'. If you get a 'q' followed by something other than a 'u', this will happen with very low probability, and hence it should be pretty alarming. Normalize the counts in your tables so that you have a probability. Then for a query, walk through the matrix and compute the product of the transitions you take. Then normalize by the length of the query. When the number is low, you likely have a gibberish query (or something in a different language).

If you have a bunch of query logs, you might first make a model of general English text, and then heavily weight your own queries in that model training phase.


For the github thing, check the files out. There's some code in them (I think this file and this one too!
(From here)

Good luck!

RYANTADIPARTHI (6011)

@Bookie0 the dictionary works. But the thing is, it doesn't work for sentences.

like

this is a sentence

if you tell me how to do the same thing for sentences, then my question is answered.

also, the others links give errors.

Bookie0 (6031)

@RYANTADIPARTHI hum, you could maybe get the length of the user's input. Or like, separate each word in the user's input into a list, then use a for loop to loop over each word, and then check it. Perhaps like this:

user_input = input("> ") # ask user to enter something

def convert(string):  # convert string to list
    li = list(string.split(" "))  # splits string into list
    return li 
  
convert(user_input)


for i in li: # for loop to go through list
  d = enchant.Dict("en_US") 
  d.check(i)  # checks each item in list

I didn't get to test this program as I'm in a little hurry now, so not 100% sure it works, but try to test it and see if it works.

Hope it does, and good luck! :D

RYANTADIPARTHI (6011)

@Bookie0 here's the repl. Few errors. I'll invite you.

but, I'm not going to be here for 12 hours or something. It's night time at my place.

so, if you get the invite, you can work there.

Thanks! comment on here if you finished. I'll check it later.

link to repl.

https://repl.it/@RYANTADIPARTHI/TragicIndigoSites#main.py

Bookie0 (6031)

What language? And by gibberish do you also mean spam?

RYANTADIPARTHI (6011)

@Bookie0 ...

uh, didn't you read my description? I already said python.

also, by gibberish, i meant like in a textbox, if someone put

kjrnfgrkwiekrjnk.

i want a detector that tells, True. Or that is gibberish.

something like that.

Bookie0 (6031)

@RYANTADIPARTHI shoot sorry about that, my bad, skipped over it! ;)
First link I sent in my second comment should have that True/False detection

RYANTADIPARTHI (6011)

@HackingGo306 ok, this is much like something i want. It detects real words, and gibberish. But, one thing is, it doesn't get sentences.

like:

this is a sentence

it says not a word.

is there a way you can find out if sentences are gibberish or not?

but this is good though.

FlaminHotValdez (442)

@RYANTADIPARTHI since you're using python, you can use .split()

FlaminHotValdez (442)

@RYANTADIPARTHI

test = "this is a test string"
words = test.split()
for i in words:
  print(i)

would output

this
is
a
test
string

Split splits strings by spaces.

HackingGo306 (15)

@FlaminHotValdez
Yes.
Using split(), I made this:
https://repl.it/@HackingGo306/sentence-or-not#main.py

However, it will only check if the words in the sentence are correct, not if the sentence itself is real or not.

FlaminHotValdez (442)

@HackingGo306 well if we want grammar, we're gonna need an ML algorithm.

RYANTADIPARTHI (6011)

@FlaminHotValdez yes, ik. But i want this using the dictionary example above. The first comment has given. could you use it with that?

RYANTADIPARTHI (6011)

@HackingGo306 ok. This is what i want. It works 97%. But one problem is, for long correct sentences, it says no valid.

could you fix that?

FlaminHotValdez (442)

@RYANTADIPARTHI Basically, when you split a string it returns a list of all the blocks of text in the string that are separated by spaces. So you can iterate over each element in the list that it returns and see if it is valid.

RYANTADIPARTHI (6011)

@FlaminHotValdez yeah, so the first comment person's solution works 97% the only problem is, for huge sentences, it says invalid input.

any idea how to fix it?

HackingGo306 (15)

@RYANTADIPARTHI
I did some debugging and found out some words don't work.
such as 'hehe' and 'haha' which I can't do anything about.🤷‍♂️

RYANTADIPARTHI (6011)

@HackingGo306 yes, no problem about that. But did you get long sentences?

HackingGo306 (15)

@RYANTADIPARTHI
Hmm. I can't seem to find any that don't work. Perhaps you can give me an example?

RYANTADIPARTHI (6011)

@HackingGo306 like this.

this is a really long sentences, this is a long list? why, really great, master.
RYANTADIPARTHI (6011)

@HackingGo306 idk. But, someone already solved it anyways.

thanks for trying.

realTronsi (913)

Not really possible, it's purely subjective. People could post links, have weird names, or reference words in another language. If you're looking for spam protection, it's better to have humans moderate, but in the case its a chatroom or forum, "gibberish" doesn't really harm anyone

RYANTADIPARTHI (6011)

@realTronsi actually, i already searched this up, and there are modules that work a little bit. But some are not available on repl.it

realTronsi (913)

@RYANTADIPARTHI well those modules are bs. Unless you create a neural network and train it to understand english (which even then won't be fool proof), there's just too many factors

  • context
  • names
  • links
  • symbols (emojis, ascii art, unicode symbols, etc.)
  • code (what if someone just wants to share some code)

... and much more

RYANTADIPARTHI (6011)

@realTronsi just for a project i'm making.

Baconman321 (1061)

To do that, you would have to somehow train the computer about the english language. If not, you might accidentally censor a real word. Otherwise, you could check if the word has a vowel since nearly all english words have a vowel in them. This is probably very weak (someone could type in kjafljfhusiuaysfujh and since it has an a it would think of it as a real word), but it could filter out a little bit of gibberish.

Otherwise, yeah you would have to train it the english language.

What you could do is use an english word checker api (like WordsAPI) which would check if the word exists and if not it would say it's gibberish (this might censor out names and abbreviations though). Gibberish isn't really "defined" in computer context, as it's more "opinion based". What is gibberish and what is not is determined by how we comprehend English.

Unless you want to join MIT (or some other advanced place) and work on a machine learning algorithm for the English language, I don't think your detector is going to be very efficient.

RYANTADIPARTHI (6011)

@Baconman321 this really didn't help. Yo just told me what's good, and not good. the API, doesn't work much. I've searched google before asking, and there were modules, and all, but some were not available on repl.it,and some weren't working properly.

So if you have a module, or a good algorithm that works, please let me know.

Baconman321 (1061)

@RYANTADIPARTHI I'm giving you options.

Don't expect me to write code for something this complicated.

The best option I would go for is a English word/name checker like the API I gave you (if it doesn't work then use a different one).

All you have to do is check if the word/s someone entered match up with valid words/names.

If so, then return false or give a message that it's not gibberish.

Otherwise return false or a message saying it's gibberish.

Machine learning algorithms are time costly (especially one of this type) and often complicated.

I recommend just using a word checker API and seeing if the post contains real words.

Maybe have a "gibberish" word count and if it goes over a limit then just say that the user posted gibberish as well?

RYANTADIPARTHI (6011)

@Baconman321 ok. At least a module? that makes things easier.

Baconman321 (1061)

@RYANTADIPARTHI I don't know of a module. A word checker API would be easier, you just send a request to it and wait for a response. Depending on the API you would check if the "word" the user put in is a real one or not.

However, you might want https://pypi.org/project/gibberish-detector/

I don't know if you can install it though, you probably might be able to...

RYANTADIPARTHI (6011)

@Baconman321 yes, i tried that. I don't think it works here on repl.it properly. Try it yourself, and see.

Baconman321 (1061)

@RYANTADIPARTHI I don't know python (some but barely any).

From what I see, you would have to install pip (maybe ask a question about that).

Otherwise I don't know...

Baconman321 (1061)

@RYANTADIPARTHI Oh. It's the package itself that doesn't work?

I don't know then...

:(