so, i'm making something, and i want to have a gibberish detector in it. Meaning like, if someone tried to post gibberish, it says True, or gives a message saying is gibberish. or else, not gibberish.
If someone knows how to do that, please let me know. This is in python. Also, it should work in repl.it
there are certain libraries in pYthon with the whole dictionary. You could go through the input and see if each of the words are in the dictionary. This, however, is a less than ideal solution because there are many "slang" terms that is doesn't recognize. You can import the library with
Simple: Use a dictionary api
Slang and names will be classified as gibberish then. So you should only classify a message as gibberish if more then half a message's words are gibberish.
Additionally, you can flag multiple constanants in a row as gibberish, as well as weird capitalization, numbers and symbols inside words, and also length of a word.
oh, hum. Maybe try to install a python IDE on your computer, like IDLE, and try it there. If it works there, then yea probably means that it doesn't work on Python.
Click here to download python.
also, well, like @realTronsi, said gibberish is kinda subjective. Otherwise, I would go with @Baconman321's idea, to find something that can check if the word of the user is an existing English word like this.
Or, using this dictionary module, you could check if what the user wrote exists, since that module should contain every English word.
Or, using the Markov Chain and this github page, this is what it could do:
Go check it out for more details! :D
@RYANTADIPARTHI For the Pydictionary, I'm afraid there isn't any code or anything to find if a word is in a dict, it was just a suggestion I made up myself.
However, check this:
>>> import enchant >>> d = enchant.Dict("en_US") >>> d.check("Hello") True >>> d.check("Helo") False
According to that site above, this is how to use it.
And also check this out:
You could build a model of character to character transitions from a bunch of text in English. So for example, you find out how common it is for there to be a 'h' after a 't' (pretty common). In English, you expect that after a 'q', you'll get a 'u'. If you get a 'q' followed by something other than a 'u', this will happen with very low probability, and hence it should be pretty alarming. Normalize the counts in your tables so that you have a probability. Then for a query, walk through the matrix and compute the product of the transitions you take. Then normalize by the length of the query. When the number is low, you likely have a gibberish query (or something in a different language).
If you have a bunch of query logs, you might first make a model of general English text, and then heavily weight your own queries in that model training phase.
@RYANTADIPARTHI hum, you could maybe get the length of the user's input. Or like, separate each word in the user's input into a list, then use a
for loop to loop over each word, and then check it. Perhaps like this:
user_input = input("> ") # ask user to enter something def convert(string): # convert string to list li = list(string.split(" ")) # splits string into list return li convert(user_input) for i in li: # for loop to go through list d = enchant.Dict("en_US") d.check(i) # checks each item in list
I didn't get to test this program as I'm in a little hurry now, so not 100% sure it works, but try to test it and see if it works.
Hope it does, and good luck! :D
@Bookie0 here's the repl. Few errors. I'll invite you.
but, I'm not going to be here for 12 hours or something. It's night time at my place.
so, if you get the invite, you can work there.
Thanks! comment on here if you finished. I'll check it later.
link to repl.
Using split(), I made this:
However, it will only check if the words in the sentence are correct, not if the sentence itself is real or not.
Not really possible, it's purely subjective. People could post links, have weird names, or reference words in another language. If you're looking for spam protection, it's better to have humans moderate, but in the case its a chatroom or forum, "gibberish" doesn't really harm anyone
@RYANTADIPARTHI well those modules are bs. Unless you create a neural network and train it to understand english (which even then won't be fool proof), there's just too many factors
- symbols (emojis, ascii art, unicode symbols, etc.)
- code (what if someone just wants to share some code)
... and much more
To do that, you would have to somehow train the computer about the english language. If not, you might accidentally censor a real word. Otherwise, you could check if the word has a vowel since nearly all english words have a vowel in them. This is probably very weak (someone could type in
kjafljfhusiuaysfujh and since it has an
a it would think of it as a real word), but it could filter out a little bit of gibberish.
Otherwise, yeah you would have to train it the english language.
What you could do is use an english word checker api (like WordsAPI) which would check if the word exists and if not it would say it's gibberish (this might censor out names and abbreviations though). Gibberish isn't really "defined" in computer context, as it's more "opinion based". What is gibberish and what is not is determined by how we comprehend English.
Unless you want to join MIT (or some other advanced place) and work on a machine learning algorithm for the English language, I don't think your detector is going to be very efficient.
@Baconman321 this really didn't help. Yo just told me what's good, and not good. the API, doesn't work much. I've searched google before asking, and there were modules, and all, but some were not available on repl.it,and some weren't working properly.
So if you have a module, or a good algorithm that works, please let me know.
@RYANTADIPARTHI I'm giving you options.
Don't expect me to write code for something this complicated.
The best option I would go for is a English word/name checker like the API I gave you (if it doesn't work then use a different one).
All you have to do is check if the word/s someone entered match up with valid words/names.
If so, then return false or give a message that it's not gibberish.
Otherwise return false or a message saying it's gibberish.
Machine learning algorithms are time costly (especially one of this type) and often complicated.
I recommend just using a word checker API and seeing if the post contains real words.
Maybe have a "gibberish" word count and if it goes over a limit then just say that the user posted gibberish as well?
@RYANTADIPARTHI I don't know of a module. A word checker API would be easier, you just send a request to it and wait for a response. Depending on the API you would check if the "word" the user put in is a real one or not.
However, you might want https://pypi.org/project/gibberish-detector/
I don't know if you can install it though, you probably might be able to...