Economists and data scientists have been using newspapers as the basis to train machine language programs for decades, so the fact that AI systems are potentially due to use this data to train their machine language database offers no surprise.


By their nature, news organisations report on worldwide and regional events, and once these stories are published, they are usually republished on feeder publication sites or social channels. This news gets collated into multiple formats, which then becomes public knowledge, and the information “text as data” or “pictures as data” provided within the news piece is open for use. If AI systems are going to use news datasets – which are available to purchase – to train the generative AI tools in the market, there’s not a lot a media outlet can do to categorically state that the information has come from their news outlet.


How is AI being used in academics?

Since OpenAI outputs are generated from multiple data sources, the information, though static, is considered dynamic. The outputs from systems such as ChatGPT are citation-less – creating a major headache for validity and accuracy of information. This has caused quite a debate with use of OpenAI ChatGPT within academic circles, as on multiple occasions the output created by the tool has generated false information. Educational institutions are still working out if and how a similar tool can be used by students for citation in academic writing, but the general consensus seems to be accepting of its use for research but not for citing, as the data is not considered a factual source.


The above creates major dilemmas for use as there’s a large number of SEOs, marketers and developers currently using ChatGPT for content creation and HTML coding outputs. If this level of usage continues and increases (as expected), there will be further development of “SEO tools” using AI outputs potentially leading us into a situation where website content is based on ChatGPT outputs, which could mean going back to the old days of Google with low quality, scraped content ranking.


With citation and transparency being the main issue, how do other AI models counteract this?

The AI-powered Bing search engine using a next generation OpenAI language model plus Microsoft’s Prometheus model handles the attribution well. From a search query, the engine is able to provide a number of options based on the results and the “result box” contains paragraph copy utilising a citation system (think Wikipedia) that highlights part of the answer paragraph but then cites the source that this information came from. However, this is far from ideal as its mainly a site title link as the resource not a proper reference citation – there are some other AI systems that do this better, but their use is relatively low.


Citation and source data should be the gold standard of AI attribution as its fair, open to scrutiny, and transparent for not just the users but also the marketers/content creators. But as of yet, we are far from this.


Can OpenAI systems achieve credibility by providing references and citations for news data in machine learning?

The only concern that would be around using news data for AI machine learning is that newspaper content, although monitored by IPSO, still contains a lot of non-factual, opinion pieces that lack citation or valid sources (as defined by Gentzkow and Shapiro journalists/media outlets can have a conformation biased opinion based on their political leanings).


For OpenAI systems to achieve a higher level of credibility, the tools must provide references and citations like open-source platforms such as Wikipedia. To have any content published on Wikipedia, you must go through a strict approval process – which is why over the years the platform has become the “gold standard” for data sources. OpenAI uses Wikidata within the machine learning datasets, so there is no reason why it can’t cite the factual data from opinion based. At the moment, the output information is usually collated, which is where the transparency issue comes in.


A user would have a higher value from an OpenAI system that provides a range of outputs based on a query that had the citation source available. But this would more likely be the development of an AI search engine not a “quick question prompt and answer” – studies have suggested it takes a user between 5-10 prompts based on the same question to find the correct answer.


Therefore, if news data is to be fed into the OpenAI tools this data should follow the Wikipedia process of submission to focus on Eligibility & Notability – “verifiable evidence that the organization or product has attracted the notice of reliable sources unrelated to the organization or product.”


By utilising the above criteria, this would ensure that content from news and media publications can be used within the dataset for the LLM/ML algorithm, meaning that news data could be attributed correctly with full within the correct specific format: APA, MLA, and Chicago style are three of the most commonly used, this would stop false news and topic misinformation.


The integration of news data for machine learning in AI and media presents opportunities and challenges. However, the lack of citations and references in AI-generated content raises concerns about credibility. Academic debates surround the use of AI-generated content for citation, while SEOs and marketers face dilemmas regarding content quality. Attribution attempts by AI-powered search engines are limited, hindering the establishment of a gold standard for AI attribution. To enhance credibility, OpenAI systems should incorporate verifiable sources and adhere to citation formats. By striving for comprehensive citations, AI systems can improve transparency and reliability, positively shaping the future of AI and media.


If you are about to embark on a website migration project and would like to find out more about our website migration or SEO services, please get in touch to discuss your project with one of our SEO experts. Send us an email at or fill out our form and we’ll get in touch.

Related posts

  • ai
  • paid search
  • paid search agency