AI LLMs simply are better at surfacing it
Ok, but how exactly? Is there some magical emergent property of LLMs that guides them to filter out the garbage from the quality content?
old profile: /u/antonim@lemmy.world
AI LLMs simply are better at surfacing it
Ok, but how exactly? Is there some magical emergent property of LLMs that guides them to filter out the garbage from the quality content?
If you don’t feel like discussing this and won’t do anything more than deliberately miss the point, you don’t have to reply to me at all.
they’re a great use in surfacing information that is discussed and available, but might be buried with no SEO behind it to surface it
This is what I’ve seen many people claim. But it is a weak compliment for AI, and more of a criticism of the current web search engines. Why is that information unavailable to search engines, but is available to LLMs? If someone has put in the work to find and feed the quality content to LLMs, why couldn’t that same effort have been invested in Google Search?
deleted by creator
Here in my southeast European shithole I’m not worrying about my tax money, the upgrade is going to be pretty cheap, they’re just going to switch from unlicensed XP to unlicensed Win7.
Yep, but I didn’t mention that because it’s not a part of the “Wayback Machine”, it’s just the general “Internet Archive” business of archiving media, which is for now still completely unavailable. (I’ve uploaded dozens of public-domain books there myself, and I’m really missing it…)
You can (well, could) put in any live URL there and IA would take a snapshot of the current page on your request. They also actively crawl the web and take new snapshots on their own. All of that counts as ‘writing’ to the database.
I don’t get the impression you’ve ever made any substantial contributions to Wikipedia, and thus have misguided ideas about what would be actually helpful to the editors and conductive to producing better articles. Your proposal about translations is especially telling, because the machine-assisted translations (i.e. with built-in tools) have already existed on WP long before the recent explosion of LLMs.
In short, your proposals either: 1. already exist, 2. would still risk distorsion, oversimplification, made-up bullshit and feedback loops, 3. are likely very complex and expensive to build, or 4. are straight up impossible.
Good WP articles are written by people who have actually read some scholarly articles on the subject, including those that aren’t easily available online (so LLMs are massively stunted by default). Having an LLM re-write a “poorly worded” article would at best be like polishing a turd (poorly worded articles are usually written by people who don’t know much about the subject in the first place, so there’s not much material for the LLM to actually improve), and more likely it would introduce a ton of biases on its own (as well as the usual asinine writing style).
Thankfully, as far as I’ve seen the WP community is generally skeptical of AI tools, so I don’t expect such nonsense to have much of an influence on the site.
As far as Wikipedia is concerned, there is pretty much no way to use LLMs correctly, because probably each major model includes Wikipedia in its training dataset, and using WP to improve WP is… not a good idea. It probably doesn’t require an essay to explain why it’s bad to create and mechanise a loop of bias in an encyclopedia.
It has custom user-made themes that are dark mode, so it probably has dozens of dark modes.
Yeah I’m wondering as well. It seems to save webpages, whereas the issue is with scanned books which may be removed from IA…
So child porn is okay then? You would already have it on your system
You’d have to look for it, knowing fully well that it is illegal to produce in the first place and distribute to others, access it online, and then deliberately retain it. It’s not really the same as something that’s legal to produce and distribute (it is certainly legal for me to view your site). You wouldn’t “already” have it.
I doubt you are either.
Well I’ve read some copyright laws, had to solve some issues regarding usage of copyrighted works, etc. Nothing that makes me an expert, but I’m not talking wholly out of my ass either.
It does… on paper… A lot. https://time.com/6266147/internet-archive-copyright-infringement-books-lawsuit/ To the point it’s losing lawsuits over exactly that.
That’s not Wayback Machine per se, that’s Internet Archive’s book scanning and “digital lending” system, which was most definitely doing legally questionable (and stupid) things even to an amateur eye. However, Wayback Machine making read-only copies of websites has for now never been disputed successfully.
What do you mean by “saving a copy”? I still have the .doc file somewhere in my emails. If I told you I’m a serious published writer, and then you asked me where you can read my texts, and I sent you a .doc that hasn’t been proofread, would you take me seriously?
You don’t have any rights to do anything else with it.
That’s patently false. At a minimum, I can quote parts of your content, just as you can quote smaller portions of any published text anywhere, you don’t have to ask the publisher or author for permission. It’s also ridiculous and impossible to control, the content is on my private machine already, how can any law be relevant or exerted upon what I do there? I doubt you’re writing this comment on the basis of your knowledge of copyright law.
Incorrect. Your browser made it do that. How that data is accessed and displayed is not controlled by me.
You’re arguing semantics that really don’t make any difference. The display is irrelevant, because the data by itself is stored on my computer before it is displayed. That data is what you’ve put up online to be accessed.
Owning the CD grants you a license to the content on that CD. That’s about as good as ownership gets there. They own the CD/license. As long as that CD exists/works. You don’t gain that same right by simply visiting a website.
I fail to see the difference between getting a CD with some data (buying it or being given for free, as e.g. a gift) and being sent some data online for free. More importantly - says who? Does copyright law say this about websites?
If an artist makes a painting… and posts a picture of it. They have no rights to the painting anymore? They deserve no ownership/pay for what they’ve done?
This simply doesn’t follow from what I’ve written. They certainly retain the rights to the painting. Besides, “deserving pay” depends on completely different factors than the ones we’re discussing, usually artists sell the actual object, the painting. A digital reproduction is, as far as most people care (I think), merely an informative reproduction, and not the real thing. Stuff that’s posted online for free is… free. It wasn’t intended to be made money with directly.
Your final paragraph is really confusing me, you seem to be saying that Wayback Machine is also committing theft, which I’m pretty sure is not true (I’ve followed the lawsuits against IA for a while and don’t remember anyone invoking that term). And at this point I don’t know what “theft” is even supposed to mean to you or to anyone else, and what was the point of the discussion anyway. Maybe I should reread the whole discussion carefully all over again, but I’m on my phone and it’s all giving me a headache.
, it’s a salty article
Actually the author himself is somewhat harmed by this situation. I would be salty too. When I wish to write my CV, I can say: my text have been published at X and Y. Especially nice if it’s an important and well known publication. Now a part of his CV is literally erased, he can’t access his own texts anymore (not even on Internet Archive). That’s… utterly ridiculous. It’s a common practice to send the author a copy (or multiple) of the text he has published, he has every right to own a copy of them. Now the copy that was intended to be available to everyone is not available even to him. Something of the sort really has happened to me too when a website I published an article on a site underwent a redesign and now the text just isn’t available anymore. Admittedly it’s still on IA, but it’s an awkward situation.
For what it’s worth, I agree with the other commenter and, as much as I dislike AI as it currently is, I have never and probably never will bitch about the scraping. If I put things out there online, I am aware that they may be used in ways that I never intended. That’s how it has always been, after all.
You’ve put it out there for free, though, and the data literally ends up on my machine because you made it do that, so what’s the problem with me saving the data on my machine for later, and potentially sharing it elsewhere for free again?
then publishing it as your own is theft
This scenario (misattribution of content) has nothing to do with the previous discussion. The other commenter is making an analogy to CDs, owning a CD and lending it to others doesn’t mean you’re claiming its content is your own creation.
Theft implies deprivation of ownership. Calling this theft is like calling piracy theft. It may be illegal by this or that metric, but it’s not normal theft.
Absolutely the correct stance, nothing dirty about it. At this point, for better and for worse, the Internet is a basic necessity. Imagine having your water turned off because you threw water balloons at your neighbour.
And that’s more or less what I was aiming for, so we’re back at square one. What you wrote is in line with my first comment:
The point is that there isn’t something that makes AI inherently superior to ordinary search engines. (Personally I haven’t found AI to be superior at all, but that’s a different topic.) The difference in quality is mainly a consequence of some corporate fuckery to wring out more money from the investors and/or advertisers and/or users at the given moment. AI is good (according to you) just because search engines suck.