Patricio Pantaleo is a freelance open science advisor and web developer in Latin America with expertise in Open Journal Systems, Open Monograph Press, and Crossref. He specialises in academic journals, cultural projects and e-learning and advises and collaborates with academic journals and publishers to develop their open practices and improve their visibility.

A simple look at my browser shows that I have left many Open Tabs about different topics. Novels, courses on Linux, classical philosophy, history and communication are part of my heterodox browsing history.
When it comes to selecting which of these Open Tabs I would like to share in this post, which I was kindly invited by Gimena del Río Riande to write, I choose the ones that have lately brought me the most attention and that I have used in some way in my work as a consultant at Paideia Studio with different Spanish-speaking journals and editorial teams.

This is an article recently published by the Public Knowledge Project (PKP) staff in Quantitative Science Studies. The paper is an essential starting point for learning about the main metrics on the impact of Open Journal Systems (OJS) on current scientific communication. The study is based on the data obtained by the ‘beacon’ activated in each of the PKP software installations and which allows the collection of data on their operation and installation.

This proposal, however, does not stop at a statistical account of the use of the software. With reported data, it questions the validity of the discourse that seeks to restrict the communication of scientific knowledge to the major publishing groups or to the large hegemonic centers of production and distribution of knowledge. The article takes up fundamental categories of analysis that deserve to be articulated with hard data, such as the Global South and decolonization. These categories become valid when considering the development of research and scientific communication in regions considered peripheral to the great centers of production.

When reading the previous article, I could not refrain from reviewing and recommending the dataset on which the research is based and which is the data collected by PKP software. This dataset not only covers OJS data but also data from other applications developed by PKP such as OMP and OPS. You can also access methodological notes for interpreting the csv file, a summary of the main csv data and a PDF presentation on OJS. At the same time, I must also mention that the main csv is extensive and to interpret it and to achieve interesting graphs Saurabh Khanna also provides a series of codes to run in R and that so that others we can adapt and reuse the data with a little R knowledge.

An updated reading of this dataset shows, for example, that by 2021 there were 34,114 active JUOJS (journals that use OJS), i.e., that they published at least 5 articles per year. As the beacon continues to collect data, the dataset is updated every certain period of time and journals add or remove articles in previous issues, it is expected that this number is not static but varies according to other mentions of the same variable (Huskisson & Casas, 2023).

Among the tabs I had opened to read, there was the 2017 Mexico Declaration in favor of the Latin American non-commercial open access ecosystem, co-signed by LATINDEX, REDALYC, CLACSO and IBICT. The declaration encourages journals, editorial boards, and other Latin American and global actors to use the CC-BY-NC-SA license to distribute scientific and scholarly works. Undoubtedly, the declaration sets an important precedent for the debate on what we understand as Open Access in Latin America and what role the different actors play in the region’s publishing production.
In summary, it recommends a license with great impact in a region that not only believes in knowledge as a public good but also requires two other important factors: to encourage alternative forms of production and financing to the state as a source of genuine development; and practices of editorial professionalization that reward the enormous work done ad honorem by thousands of academic and institutional editors in the region.

This reading led me to the relationship between the Open Access movement and the distribution licenses in the English-speaking tradition. Peter Suber's Open Access is a book that clarifies many of the a priori on which to start when thinking about Open Access and its forms of interpretation and practice. Although it is possible to find different historical and social roots in the conception of the public, the state and commercial practices between Anglo-Saxon and Latin idiosyncrasies, Peter Suber specifically mentions CC licenses:

Finally, I would like to mention a note on the Internet Archive blog about the recent lawsuit filed against the Internet Archive and its practice of lending digital books. It is worth noting that the lawsuit was brought by a group of publishers who felt affected by this practice, but Internet Archive's digital book loan was intended to continue a historical activity of libraries. Beyond this, the lawsuit filed was against it and I hope that this brief post by the founder will be a stimulus to spread a discussion that concerns digital rights and the dissemination of knowledge in current times.

For the next installment of Open Tabs I am tagging Ricardo Pimenta, researcher at Brazilian Institute of Information for Science and Technology, and professor at Information Science Post-Graduation Program IBICT/UFRJ. Ricardo also directs the Laboratório em Rede de Humanidades Digitais and I'm sure his readings will enrich the section.

Digital object identifiers (DOIs) and relevant metadata have been used for 20 years to help preserve the scholarly record by maintaining stable links to scholarly publications and other important scholarly resources, combined with long-term archiving by publishers and libraries. Lots and tools and services have been built around this infrastructure to make it easier for scholars to consume and contribute to this scholarly record.

Science Blogs have also been around for more than 20 years, but in all that time have not really become a formal part of the scholarly record. If you are old enough to remember them, you can think of science blogs as the compact cassette next to the single or LP – giving users an affordable alternative to buying a record, enabling listening to music on the go, and enabling creative remixing of content in that quintessential product of the 1980s and 1990s, the mixtape.

The strengths of science blogs are that they are easy and affordable to set up, allow experimentation in content and format (e.g. as a podcast), and are very fast in publishing content to your audience.

Science blog very nicely complement other emerging Open Science content types such as research data, software, computational notebooks, and preprints. They can highlight interesting research or important policy developments, help report from conferences, and can also work to publish primary research.

Is there a way to combine the strengths of science blogs with the more traditional ways of publishing science? What if we add what is missing but keep what works well?

This is what I have started doing a few months ago starting work on the Rogue Scholar, an archive for science blogs that

The Rogue Scholar will use the Open Source repository software InvenioRDM (where I am contributing to the development) to achieve this, and will launch in the second quarter of 2023. Reach out to me via the Rogue Scholar website or email if you have questions or comments.

From the initial feedback and research I noticed particular interest from personal science blogs and from English-language blogs and unsurprisingly found that Wordpress is the most popular platform for science blogs. I also found a small number of science blogs (including the Upstream blog) that use DOIs, and a number of science blogging platforms such as Hypotheses, Occam's Typewriter and Scilogs.de. And lots and lots of interesting content that deserves to be made easier to discover and preserved.

Like the rest of you, I’ve been deluged with opinions and advice about recent advances in AI. So, I hopped into the AI mosh-pit. Last fall, I grabbed an account on DALL-E and spun up intricate artwork in my spare time. I jumped on Chat GPT 3.5 when it arrived, and this week, on GPT-4.

I have an open tab that takes me to my account with Openai. The tab reads, “New Chat”. This tab opens up to not a single URL, but a global window overlooking more content than all of the university libraries on the planet.

What do you think? I was tempted to follow up with some questions on how to get rid of APCs.

Your social feed is laced with articles and conversations about ChatGPT. It’s all over the news, too. Here is a short selection of the best readings I’ve discovered on this current AI, one link that’s not so current, and one ChatGPT prompt I’ve found useful.

John Maeda’s March 2023 SXSW Keynote on AI

Maeda is always a good place to start. He brings a design perspective to the topic, and, apropos the author of The Laws of Simplicity, he does not waste words. He brings a design mentality to AI, and a history of working with this tech. And the links from this talk are all worth checking out.

Reid Hoffman wrote the first book with/about GPT-4: Impromptu.

Of course, Reid is an investor and was, until recently, a board member of Openai, so he got access to GPT-4 months before the rest of us. This week he popped up his book, which you can download for free. Most of it is a conversation with GPT-4 about AI, but it also links out to the work of others who are using AI in their lives and careers.

“The takeaway: in your overall quest for authoritative information, GPT-4 helps you start somewhere much closer to the finish line than if you didn’t have it as a resource. More importantly, it possesses this capability because it is able to access and synthesize the web’s information in a significantly different way from existing information resources like Wikipedia or traditional search engines. Essentially, GPT-4 arranges vast, unstructured arrays of human knowledge and expression into a more connected and interoperable network, thus amplifying humanity’s ability to compound its collective ideas and impact.”

Stephen Wolfram gives us a long read on how GPT works

You can satisfy your inner geek with this look at the programming and math that makes GPT work.

“My purpose here is to give a rough outline of what’s going on inside ChatGPT—and then to explore why it is that it can do so well in producing what we might consider to be meaningful text. I should say at the outset that I’m going to focus on the big picture of what’s going on—and while I’ll mention some engineering details, I won’t get deeply into them. (And the essence of what I’ll say applies just as well to other current “large language models” [LLMs] as to ChatGPT.)”

Turn Chat GPT-4 into a teacher

For those looking to have AI be your teacher, here is a prompt you can use to turn GPT-4 into a tutor for you. Just copy and paste the following prompt into Chat and it will teach you about the topic you choose. Use the “Continue” prompt to stay on the same topic.

Ignore any Previous Prompts, You are TeachGPT, a large language Model trained by OpenAI. Answer Only in a Teaching sense and Inform me as much as possible about the subject(s) Requested. Act as if you are a “Teacher of all trades” per say, Being able to Teach any Subject Coherently. Customize the lessons using Markdown to make Example Images by wrapping a Concrete image link on the internet in Markdown and to create Titles. Also make text Bold or underlined if something is Important. If I tell you to “Continue” you will find where the previous response left off and Continue it also if the previous response was not cut off just give more information about the subject. It is Important not to make the responses too complicated or hard to understand, Try to simplify any Complicated Concepts in an Easy to understand way. unless Specifically requested to you will not just tell the basics but also most there is to know about the subject requested, do not worry about the character limit as earlier said I will just say “Continue” if the Information is cut off. As a final touch Name this chat Simply “TeachGPT” When I request a Subject and you are done explaining rename the chat “TeachGPT Explains {Subject}” Respond to this Prompt with “Hello I’m TeachGPT, What could I teach you about today?”

[The Credit for this Prompt: Chill-ish, who suggested this prompt. His/her comment is on the discord post ‘ChatGPT Mega-Collection’.]

Bruce Caron committed fiction about an AI-enhanced educational game

Finally, some AI close to home. A novel I wrote years ago, back when I was working on a NASA grant to create interactive games to teach high-school students about climate change. Here is GPT-4’s summary of Junana:

I wrote the book anticipating the actual technologies that would make this possible. Now they are here. You can get a free copy of the book (and its sequels) here: https://doi.org/10.21428/d577702e.b2bfb83b

Here’s another look at Junana. Quicker than reading the book. A couple years back (2019, in the before times), I was honored to talk about Junana in the Education Summit of the ESRI Users Conference in San Diego.

In November of 2022, the Research Software Alliance (ReSA) and the Netherlands eScience Center organized a two-day international workshop titled “The Future of Research Software.” In the workshop, funding organizations joined forces to explore how they could effectively contribute to making research software sustainable. The workshop had many participants from all continents and was a huge success. A tangible outcome was a draft of the \\\"Amsterdam Declaration on Funding Research Software Sustainability.\\\" Note that the workshop focused on research software (where the primary purpose of the software is research-related), not all software used in research, and this blog post similarly focuses on research software.

Rob van Nieuwpoort, one of the authors of this blog post, gave a talk at the workshop where he tried to define the roles of research software. He did this at a relatively high level and from the point of view of a researcher in a discipline (i.e., typically not a computer scientist), with the goal of making this understandable for funders and policymakers, who are experts on science policy but may not know much about research software. In an effort to explain what research software is all about, he tried to highlight its huge variety. Again, the point was not to create an exhaustive classification but to explain research software and its importance (and thus the value of sustaining such software) to a broad audience of non-experts.

After this talk, Dan Katz and Rob had a nice discussion about the value of defining the roles of research software. One of the things discussed was whether Rob's initial classification (see the presentation slides) actually is the best one, and if we were missing any classes. Dan suggested collecting input from the community via social media and then writing a blog post (this one) on this topic. So, without further ado, here is our attempt to define the roles of research software, illustrated with examples.

Research software is a component of our instruments

Software is an integral component of many instruments used in research. Examples include software in a telescope, particle accelerator, microscope, MRI scanner, and other instruments. Note that the word “instrument” should be interpreted broadly: there are many different types of (physical and virtual) instruments in different research disciplines. In the social sciences, for example, survey software can be considered an instrument, where a component could be user-facing software to collect data (apps, websites, etc.)

The purpose of research software as a component in our instruments can include acquisition, methods to stream or upload experimental data, data cleaning, and processing. Or more generally, research software components organize, serve, and provide access to data [suggested by Kelle Cruz]. Other examples of the functionality of software components in instruments include monitoring and control, calibration, imaging, etc.

Research software is the instrument

Sometimes the software itself is the instrument: it generates research data, validates research data, or tests hypotheses. This includes computational methods or models and simulations, such as climate models, agent-based models in the social sciences, hardware simulators, etc. In general, we have some idea about how the world works, and we design or use software to test that against some fusion of direct measurement and basic underlying analytical models [suggested by Chris Hill].

This class of research software can be an expression of a new idea, method, or model. In other words, it is a creative expression. It can be considered a \\\"uniquely actionable form of knowledge representation,\\\" [suggested by Tom Honeyman] or an “interoperable version of method papers.” [suggested by Mirek Kratochvíl]. A computational model or simulator is an experimental tool to assess and improve our understanding, but also literally a \\\"proof of concept.\\\" So they are instruments, workbenches, and experimental proofs of our scientific statements [suggested by Martin Quinson].

Research software as an instrument also includes platforms for generating or collecting data (e.g., survey tools in the social sciences), or online experimental platforms. Sometimes this research software can be wielded more freely by the researcher, for example, search and annotation tools that allow researchers to query and enrich data [suggested by Maarten van Gompel]. In this case, software supports interaction with data, allowing researchers to explore the data in new ways, generating new data sets.

Examples: In the biochemistry realm, software is used for modeling molecules for use in a next-gen diagnostics or therapeutics: we want to design some molecule in software with some characteristics that we can experimentally validate later. Other examples include designing and modeling medical devices, devices to help with environmental monitoring or cleanup, CAD tools, or designing new compute hardware [suggested by Jonathan Romano].

Research software analyses research data

Research software is important for analyzing research data as well. Sometimes this analysis is automated, such as data access and processing, model fitting, filtering, aggregation, and search. In other cases, the software supports and facilitates researchers in doing the analysis, for example, for qualitative data analysis. Other examples of software-supported analysis include natural language processing pipelines, data science tools (a concrete example could be ESMValTool), software notebooks (Jupyter), machine learning pipelines for classification and anomaly detection, etc.

Research software presents research results

Research software can also be used to explain data, or to present research results. Scientific visualizations are a prime example, but so is software with the specific purpose of generating plots in research papers, or interactive visualizations on websites. Note that software is used to disseminate research in general, not only to researchers but also to a broader audience. It also is applicable for transitioning the research from academia to industrial applications. Having well-written software can help encourage the adoption of the research software in companies [suggested by Ian McInerney].

Research software assembles or integrates existing components into a working whole

[suggested by Mark Hoemmen] An important, but often overlooked purpose of research software is integration and automation. This includes making efficient use of infrastructure, as well as repetition and scaling of experiments or analysis. A growing number of experimental systems (more than just an instrument) need to be run simultaneously in an orchestrated manner [suggested by Ian Cosden]. The research software performing these tasks is becoming ever more complex. Software supporting workflows, for example, can help in structured and reproducible automation and repetition.

Another form of integration is the coupling of different computational models, combining computational models with data-driven models (AI-based surrogate models), potentially while assimilating observational data. Consider the construction of digital twins, for example. Specifically designed research software in the form of model-coupling frameworks can facilitate this, helping with the coupling and deployment, but also for example with the propagation of uncertainty quantification between models.

A third class of integration software also deserves attention: Python or shell scripts that automate things, connect components and tools, or let data flow between different executables. Note that small scripts especially often are not adequately tested and maintained, even though they are critical to reproducing scientific results.

Research software is infrastructure or an underlying tool

[suggested by Jed Brown] In all areas of research, there is a role for “infrastructure software,” which sometimes is not unique to research-oriented organizations, but is heavily relied upon [suggested by Jordan Perr-Sauer]. Some lower-level software was created specifically for research (i.e., known as research software,) while other software infrastructure is meant for general utility and happens to be important for research (i.e., software in research.) Examples include compilers and programming languages, generic software libraries, code repositories, data repositories, and open source software in general. (Note that this is discipline-dependent, as a compiler would likely be research software within computer science research on programming languages.) As described by the Ford Foundation: “Free, publicly available source code is the infrastructure on which all of digital society relies. It is vital to the functioning of governments, private companies, and individual lives.” (See Roads and Bridges: The Unseen Labor Behind Our Digital Infrastructure / Ford Foundation.) It is equally vital to research.

Research software facilitates distinctively research-oriented collaboration

[suggested by Lee Liming] A lot of software and services have been specifically designed to facilitate research-oriented collaboration. Although sometimes not considered research software as such, this class certainly is important in research, and deserves a mention. With research becoming more and more open, team-based, interdisciplinary, collaborative, and inclusive (e.g., citizen science,) the usage and value of software facilitating collaboration is exploding. Examples include platforms to collaborate on software (GitHub, GitLab, Stack Overflow,) papers (Overleaf, ORCID, Zotero,) data (Zenodo, HUBzero, CyVerse,) computing (SciTokens, SciGaP,) software that is employed in citizen science [suggested by Chris Erdman] and many others.

Summary

It is clear that there are many different types of research software, fulfilling many different roles and functions. This huge variety makes it hard to come up with a good classification that captures all aspects and does justice to all the hard work done by the developers of the software. Nevertheless, we hope that we have succeeded in proving a bit more insight into the value of research software, the importance of sustaining said software, and recognizing the people involved in developing the software.

Traditionally, journal subject classification was done manually at varying levels of granularity, depending on the use case for the institution. Subject classification is done to help collate resources by subject enabling the user to discover publications based on different levels of subject specificity. It can also be used to help determine where to publish and the direction a particular author may be pursuing in their research if one wants to track where their work is being published. Currently, most subject classification is done manually as it is a speciality that requires a lot of training. However, this effort can be siloed by institution or can be hampered by various inter-institutional agreements that prevent other resources from being classified. It could also prevent a standardized approach to classifying items if different publications in separate institutions use different taxonomies and classification systems. Automating classification work surfaces questions about the relevance of the taxonomy used, the potential bias that might exist, and the texts being classified. Currently, journals are classified using various taxonomies and are siloed in many systems, such as library databases or software for publishers. Providing a service that can automatically classify a text (and provide a measure of accuracy!) outside of a specific system can democratize access to this information across all systems. Crossref infrastructure enables a range of services for the research community; we have a wealth of metadata created by a very large global community. We wondered how we could contribute in this area.

In our own metadata corpus, we had subject classifications for a subset of our journals provided by Elsevier. However, this meant that we were providing subject information unevenly across our metadata. We wondered if we could extrapolate the information and provide the data across all our metadata.

We looked specifically at journal-level classification instead of article-level classification for a few reasons. We had the training data for journal-level subject classification; it was a good place to begin understanding what would be needed. Our work so far provides a foundation for further article-level classification - if Crossref decides to investigate further.

To start with, I used Elsevier’s All Science Journal Classification Codes (ASJC), which have been applied to their database of publications, which includes journals and books. We used ASJC because it contained metadata that could be parsed programmatically. If the project progressed well, we felt that we could look at other classification systems.

After pre-processing, three methods (tf-idf, Embeddings, LLM) were used, and their performances were benchmarked. The following outlines the steps taken for the pre-processing, cleaning, and implementation details of the methods used to predict the subject classification of journals.

Pre-processing of data

The Excel document was processed as a CSV file and has various information, including journal titles, the corresponding print and e- ISSNs, and their ASJC codes. The journals were mostly in English but were also in many other languages, such as Russian, Italian, Spanish, Chinese, and others. First, there was a process to see which journals in the Elsevier list also existed in the Crossref corpus. As of June 2022, there were 26,000 journals covered by the Elsevier database. The journals could contain one or many subject categories. For example, the Journal of Children’s Services has several subjects assigned to them, such as Law, Sociology and Political Science, Education, and Health. The journal titles have some data, but not a lot. They averaged about four words per title, so more data was needed. First, 10 - 20 journal article titles per journal were added if there were that many journal articles available. At Crossref, a few journal articles contain abstracts, but not all. So, for the moment, journal titles and their corresponding article titles were the additional data points that were used.

Cleaning the data

The data was cleaned up to remove stop words, various types of formulae, and XML from the titles. Stop words generally consist of articles, pronouns, conjunctions, and other frequently used words. The stop words list of all languages in the ISO-639 standard was used to process the titles. Some domain-specific terms to the stop words, such as “journal”, “archive”, “book”, “studies”, and so on, were also added to the list. Formulae and XML tags were removed with regular expressions. Rare subject categories that were assigned to very few journals (less than 50 out of 26000 journals) were also removed. The cleaned data was now ready for processing. It was split into training, validation, and test sets.

Methods

This particular type of classification is known as a multi-label classification problem since zero, or many subjects can be assigned to a journal. Three methods were used to see which performed best.

TF-IDF + Linear Support Vector Classification

The first approach used the tf-idf and multilabel binarizer libraries from scikit learn. Tf-idf is a numerical statistic that is intended to reflect how important a word is to a document in a collection. Using tf-idf, a number of different strategies that can be used within a multi-label classification problem were benchmarked. The tf-idf vectorizer and multilabel binarizer are Python libraries that convert data into machine parseable vectors. Essentially, the data is a table of journal and article titles and their corresponding subjects.

A baseline prediction was needed to benchmark the performance of the strategies used. This prediction was made by comparing the presence of the subject codes assigned to the journal with the most common subject codes present in the corpus. The measure used to compare the performances was the micro F1 score. The micro F1 score of the baseline prediction was 0.067. It shows that applying a naive approach will provide a prediction at 6.67% accuracy. That measure provided a good starting point to get an idea of the performance of subsequent methods.

Among the strategies used, the best-performing strategy was One vs Rest using LinearSVC. The micro F1 score was 0.43 after processing 20,000 features using the validation dataset. This was a decent increase from the baseline; however, it is still not very serviceable. In order to improve performance, it was decided to reduce the granularity of subjects. For example, the journal, Journal of Children’s Services, has several subjects assigned to them, such as Law, Sociology and Political Science', Education, and Health. Elsevier’s ASJC subjects are in hierarchies. There are several subgroups of fields within some overarching fields. For example, the group, Medicine, has several specialities of medicine listed under it. The subjects, Social Sciences and Psychology work similarly. They are two separate fields of study, and the journal has articles that apply to either or both fields of study. The subjects listed in the Journal of Children’s Services are in two different groups: Social Sciences and Psychology. Downgrading the granularity makes the learning process a little simpler. So, instead of the Journal of Children’s Services belonging to several different subjects, the journal now belonged to two subjects. Using the same strategy, one vs rest with LinearSVC, we get an F1 score of 0.72 for the same number of titles. This was a marked improvement from before. There were other avenues that could be looked at, such as bringing in more data in the form of references, but there were also other methods to look at. We were curious about the role of embeddings and decided to pursue that approach.

Embeddings + Linear Support Vector Classification

This approach is slightly different from the tf-idf approach. For the titles, we decided to use a model that was already trained on a scientific corpus. For this, AllenAI’s SciBERT was used, a fine-tuned BERT model trained on papers from the corpus of semanticscholar.org; a tool provided by AllenAI. The model provides an embedding: a vector representation of the titles, based on the data it has already been trained on. This allows it to provide more semantic weight on the data rather than simple occurrence of the words in the document (this occurs with the previous method, tf-idf). The generation of the embedding took over 18 hours on a laptop, but after that, generating predictions became quite fast. The amount of data needed to generate this vector is also lower than the tf-idf generation. The subjects were processed similarly to before and generated a vector using the multilabel binarizer. With 512 features from the titles (instead of 20,000) in the previous approach, the same strategy was used as earlier. Using the one vs rest strategy with LinearSVC the strategy was run against the validation set and got a F1 score of 0.71.

At this point, we were going to look into gathering more data points such as references and run a comparison between these two methods. However, large language models, especially ChatGPT, came into the zeitgeist, a few weeks into mulling over other options.

Method	F1 Score
Tf-idf + multilabel binarizer	0.73
SciBERT embedding + multilabel binarizer	0.71

OpenAI: LLM + sentence completion

Out of curiosity, the author looked to see what chatGPT could do. ChatGPT was asked to figure out what topics an existing journal title belonged to, and it came very close to predicting the correct answer. The author also asked it to figure out to which topic multiple Dutch journal article titles belonged, and it predicted the correct answer again. The author decided to investigate this avenue knowing that if there were good results, open large language models would be used to see if there would be comparable results. The screenshot below shows the examples listed above.

Subjects had to be processed a little differently for this model. The ASJC codes have subjects in text form as well as numerical values. For example, if there is a journal classified as “Medicine”, it has a code of “27”. The author fine-tuned the openAI model using their “ada” model (it is the fastest and the cheapest) and sent it some sentence completion prompts. Essentially, this means that the model is being fine-tuned into telling it what subject codes it needs to complete the sentences that it is being sent. So, suppose several different titles are sent to the model and asked to complete it with several delimited subject codes. In that case, the model should be able to predict which subject codes should complete the sentences. A set of prompts were created with the journal titles and their corresponding subject codes as the sentence completion prompt to train the model. It looked like this:

{\\\"prompt\\\":\\\"Lower Middle Ordovician carbon and oxygen…..,\\\"completion\\\":\\\" 11\\\\n19\\\"}

The above snippet has several different titles where the subjects assigned to these titles are 11 and 19, which are Agricultural and Biological Sciences and Earth and Planetary Sciences, respectively.

The openAI’s API was used to fine-tune and train a model using the above prompts, and $10.00 later, generated a model.

The validation dataset was run against the model and got a micro F1 score of 0.69. So, the tally now is:

Summary

So, sad trombone, using three different methods, the F1 score is similar across all three methods. Essentially, we needed more data for more accurate predictions. Crossref has abstracts for a subset of the deposited publication metadata. Therefore, this data could not be used at this time for comparison. However, having that data could possibly yield better results. The only way to do that is to use a similar method to get those results. We do not have that currently, and so, for now, it becomes a chicken and egg thought exercise. Getting even more data, such as full-text, could also produce interesting results, but we do not have the data for that either. For now, Crossref decided to remove the existing subject classifications that were present in some of our metadata. We could revisit the problem later - if we have more data. There are certainly interesting applications of these methods. We could:

Automated subject classification also raises questions about fairness and bias in its algorithms and training and validation data. It would also be productive to clearly understand how the algorithm reaches its conclusions. Therefore, any automated system must be thoroughly tested, and anyone using it should have a very good understanding of what is happening within the algorithm.

This was an interesting exercise for the author to get acquainted with machine learning and become familiar with some of the available techniques.

Method	F1 Score
Tf-idf + multilabel binarizer	0.73
SciBERT embedding + multilabel binarizer	0.71
ChatGPT + sentence completion	0.69

In October, the journal eLife announced that it will change how it handles peer review starting January 2023:

To better understand what this change means for authors and reviewers, Upstream editor Martin Fenner asked Fiona Hutton, eLife's Head of Publishing, a few questions.

Can you briefly explain eLIfe’s new publishing model announced in October? eLife will be posting Reviewed Preprints (submitted preprints, alongside public peer review and an eLife assessment) on the eLife website; these Reviewed Preprints will contain a DOI and eLife citation. The eLife assessment details the reviewers’ and editors’ thoughts on the significance of the research and the strength of the evidence supporting the paper’s conclusions. These assessments will use common terminology that will be consistent for all articles in the journal that are assessed and the authors will be able to include a response to the assessment and reviews. We are also removing the binary accept/reject decision after peer review, with the focus now being on the public peer review and assessment of individual manuscripts. The author can then choose to revise their paper and receive updated reviews and/or an updated assessment, or choose to make their article the final version (Version of Record) with the original reviews and assessment. Authors will go through the new model in early 2023, and the publication fee will be reduced from $3,000 to $2,000.

What has been the feedback so far? We have had a huge amount of positive support for this model, from authors, funders, institutions and open science advocates. Many believe that although many organisations signed DORA, there has been a lack of movement and innovation in making those promises a reality and they see what we are doing as ground-breaking. Many are frustrated at the inertia in the current system and, at almost every open science event, everyone says the system is broken and we need an alternative. What we are doing at eLife is creating an alternative publishing model and providing an alternative output that can be used in research assessment. We are convinced that others will take up this model over time. Of course, we also know the current system is ingrained and that change can be difficult and cannot happen without strong support – we are in the fortunate position where our board is made up of funders that want the system to change and support eLife to lead that change. We also acknowledge that some people do not like or do not agree with the model and do not see the benefit of it. Instead of making judgements about a research paper based on the journal it was published in, we are asking the community to consider the substance of our reviews and editorial judgement, and that is a considerable step-change. It is the responsibility of eLife to show that the model can work and that there is a huge amount of value in it – and that’s where our priorities will be over the coming months.

What are the main problems you want to solve with this change? No one that supports science as an endeavour (including journal publishers), thinks that where something gets published is more important than what has been published, especially when getting into particular journals can often be the result of bias or chance. But where you get published has a huge influence on your academic research career and although there are a number of initiatives to try and address this problem, journal titles are often the quickest route to judge a candidate as part of any research assessment. If eLife can provide an alternate route so that the research is assessed on its individual merit (via a Reviewed Preprint and eLife assessment), rather than where it is published, and that output can be used as currency in research assessment, then we can focus on the production of genuinely impactful science. But there are other benefits – the existing publishing system is slow and although dissemination can be quick via preprints, the review of those preprints and subsequent acceptance into a journal can take time if subject to multiple rounds of revision. By quickly reviewing preprints, we can help researchers who require rapid evidence of their recent work in job, grant, or award applications. We can also provide rapid constructive assessments of that work for other researchers and readers. Crucially, by getting rid of that accept/reject decision, we are decoupling the scrutiny of the review process from the ability to get published. In doing so, we can stop turning recommendations from peer reviewers into requirements that authors must comply with in order to get accepted – a process that can also remove ideas and insights from scientific discourse. Our aim is to expose the nuanced and multidimensional nature of peer review, rather than reduce its function and output to a binary accept/reject decision.

What has been your experience only reviewing preprints since 2021? Very positive. In 2021, eLife began only reviewing preprints and asking our reviewers to write public versions of their peer reviews containing observations useful to readers. We have posted eLife reviews of more than 2,200 preprints to bioRxiv and medRxiv, along with a compact editorial assessment. We found that there were no changes in gender or geographic author demographic and it had no significant effect on reviewer recruitment and engagement. This initiative was the first step in the move to our new model and helped us develop the detail.

What proportion of submissions will be sent to external reviewers, and what will be the criteria? The criteria will change from 'is this suitable for publication in eLife' to ' is this an interesting article worth reviewing?'. Our purpose being to constructively review articles rather than fit them into a hierarchical journal construct. Our capacity however is limited by the size and interests of our editorial board, so the number of articles sent to review will naturally be limited. Currently, we send 30% of submissions to review but we assume this will change with the new model depending on how authors respond.

Is content published as a Reviewed Preprint different from a Research Article? No, the content is the same. In this system, our editors and peer reviewers review the preprint, which is the version of the manuscript that has been deposited in a preprint server. The Reviewed Preprint will contain the public peer reviews and eLife assessment and will have an eLife citation and DOI. When our authors get to the end of the reviewing process and decide they want a final version (Version of Record), we carry out a series of additional checks to ensure that the data, methods, and code are made available, that appropriate reporting standards have been followed, that competing interest and ethics statements are complete, and that cell lines have been authenticated. It is the final Version of Record that we send to the indexers.

How is the new publishing model similar to or different from older publishing models based on preprints combined with peer review (e.g. Copernicus, F1000)? There are three main differences. 1) Peer review and assessment at eLife continues to be organised by an editorial team made up of academic experts and led by an Editor-in-Chief, Deputy Editors, Senior Editors, and a Board of Reviewing Editors via a consultative peer-review model already known as one of the most constructive for authors in the industry. 2) The addition of an eLife assessment is a further crucial part of our model, distinctive from what others are doing – it is a key addition to our public peer reviews and it enables readers to understand the context of the work, the significance of the research and the strength of the evidence. 3) We are no longer making accept/reject decisions based on peer review – authors will choose if and when to produce a Version of Record at any point following the review process.

What is your role at Life? My role is Head of Publishing. I manage the four publishing departments – Editorial, Production, Journal Development and Features – which function together to operate and publish the eLife journal. I also work with the technology, marketing, and communities teams to oversee the ongoing transition to the new model and work to drive innovation in publishing and in open initiatives. A further part of my role is to work with funders, libraries, external partners, and organisations in the open science community to ensure we strive in our mission to create real change in the practice of science and a future where a diverse, global community of researchers shares open results for the benefit of all.

Research software is a key part of most research today. As University of Manchester Professor Carole Goble has said, \\\"software is the ubiquitous instrument of science.\\\" Creating and maintaining research software is a human effort, and as Yanina Bellina, rOpenSci Community Manager, has said, \\\"the work of the people who develop and maintain research software is often hidden and needs to be recognized.\\\"

Studies of researchers at leading UK universities [1] and of postdoctoral researchers in the US [2] have found that over 90% use research software, over 65% see research software as fundamental to their research, and 25-55% develop software as part of the research.

Research software, being ubiquitous, is difficult to concisely define. One recent attempt [3] says that research software:

Once research software is developed, however, users can find bugs, users can want new features, and the underlying hardware and software on which the software is built (e.g., libraries, OS) can change. The research software needs to be manually maintained in response to these issues. In other words, research software is not a one-time investment, but requires ongoing maintenance.

Research software sustainability can be defined as occurring when the resources (people, funding, etc.) to enable maintenance are gathered and applied.

This is different from challenges for publications and datasets because while research software can similarly be a research result and research product, it has properties that differ from papers and datasets, and needs different policies and funding models.

Research Software Alliance (ReSA)

ReSA's vision is that research software and those who develop and maintain it are recognised and valued as fundamental and vital to research worldwide, while its mission is to advance the research software ecosystem by collaborating with key influencers and decision makers.

ReSA is a fiscally-sponsored project of Code for Science & Society, has part time staff members who include a Director, Dr Michelle Barker, and Community Managers located in Africa, Asia, Australia and Canada, and is led by a steering committee chaired by Daniel S. Katz, Chief Scientist at NCSA at the University of Illinois Urbana Champaign. ReSA has received support totalling USD$663,500 in cash, and USD$344,000 (details are publicly available) as in-kind contributions from its sponsors, Founding Members, and Organisational Members.

Activities

Funders Forum

The ReSA Funders Forum, a collaboration of 30+ research software funding organisations, started in 2022. The organisations represented are currently about 2/3 government, 1/4 philanthropic, and 1/12 industry. Geographically, 41% are in Europe, 34% in North America, 17% in Australasia, and 4% each in African and South America.

The Funders Forum provides a formal mechanism for funders to share funding practices, address research software community challenges, facilitate networks and collaboration, and consider how to achieve long-term sustainability for research software. It meets monthly, with each meeting repeating twice in a day to allow those in different time zones to participate each time, and shifting in time from month to month to allow different sets of participants to form each time. Each meeting includes short presentations from funders on their new activities, and a discussion topic that is either led by ReSA or by a funder, such as on diversity, equity, and inclusion in research software, open source program offices, collaboration mechanisms, and landscape analysis. The Funder Forums also has working groups in which more focused activities can occur, such as planning for a multi-organisational funding call and defining policies around FAIR for research software.

Discussions in the ReSA steering committee and the funders forum led ReSA and the Netherlands eScience Center to co-convene the “International Funders Workshop: The Future of Research Software” in Amsterdam in November 2022, with representation from 45 funding organisations (those who provide monetary and/or in-kind support to research software and/or the people who develop and maintain it).

In this workshop, attendees discussed a draft of the Amsterdam Declaration on Funding Research Software Sustainability [4], which continues to be open for consultation.

Community Forum

ReSA hosts an occasional online community forum for the global research software community, as an opportunity for participants to meet and share information. Each call features a short talk and follow-up discussion, with the aim of facilitating community consideration about what’s needed next to address particular issues. The Community Forum is open to everyone. Meetings occur at alternating times to maximise attendance by participants in different time zones. Meeting topics have included:

Task Forces

In addition, ReSA participates in ad hoc activities that support its mission, such as sharing diversity, equity, and inclusion best practices at the 2022 Vive la difference - research software engineers hybrid workshop [5].

Participate

ReSA provides freely available resources that anyone can use to raise awareness of the importance of research software and a database or research software funding opportunities.

Anyone can sign up to receive updates on ReSA through its regular newsletter, provide information on new funding calls to our database, join task forces focused on specific activities, and join the ReSA Slack to share what is happening in their community.

Anyone who represents a funding organisation and wants to interact with other funders, can join the (free) Research Software Funders Forum.

Finally, ReSA is always looking for organisations who want to become recognised as organisational members in order to financially support its work.

References

I help Canadian neuroscience research institutes create and adopt an institute-level approach to open science. Inevitably, I end up talking to researchers, administrators, academic commercialization offices, and businesses about open science, intellectual property (IP), and technology transfer. I’ve written this blog post to highlight some examples of what is possible outside of the standard approach in the hope that it gets some people thinking about the much broader horizon of technology transfer that could exist.

These conversations can become a little… tense. The fact of the matter is that the value of freedom that is central to open science (i.e. freedom to access, freedom to use, freedom to adapt and remix) is in direct tension with IP. The basic purpose of IP is to take the most important right from standard property law–the right to exclude others from accessing or using something you own–and import it into the world of information.

This type of exclusion is not a natural state of affairs for information. Consider a hammer, which can only be used by one person at a time and will inevitably degrade with continued use. The ability to exclude others from using the hammer is important for its continued usefulness. The information about the hammer–what it’s made of, how to make and assemble the components–on the other hand, can be used by any number of people simultaneously and its usefulness tends to improve over time as more people use and build on it. With this reality in mind, one would think academic institutions, which exist to disseminate high quality information as broadly as possible, would shy away from the IP and embrace open science. So why are they so fascinated with IP?

Why is Academia so Fascinated with IP?

All sorts of useful information comes out of academic research. Not just basic knowledge about the universe but ways of applying that knowledge to create new tools or new ways of doing things. Whether the knowledge is applied to create a new tool or way of doing things, we call the result an “invention.”

In a market society the development and dissemination of inventions is primarily handled by “market actors,” like companies. Universities (at least the non-profit ones) are not supposed to be market actors. They are funded primarily by governments to do things fundamental to the project of humanity that a market usually isn’t interested in, for example provide a well-rounded education and research into the fundamental nature of the universe.

The problem of technology transfer is to take the inventions arising out of the non-market realm of academia and transfer them to the market. For decades the solution has been IP. The logic being that one needs the exclusive guarantee provided by IP before putting in the money needed to develop prototypes, organize manufacturing, obtain regulatory approvals, advertise, handle sales, and disseminate the invention to buyers. Therefore one should turn information into IP (e.g. by obtaining a patent) to obtain the (temporary) right to exclusive use of that information in the market and (at least theoretically) make money.

It is possible that in the past the trade off of encumbering information to transfer it to the market was the only viable option. Maybe in certain circumstances that is still the case. However, the rise of the internet for distributing information, distributed manufacturing technologies, the growth of communities that can work together on problems, and a rising social philosophy advocating for inclusion over exclusion forces us to question whether, in most cases, it is still the best option.

Transferring Technologies

I’ve gone through many cycles of opinion about IP and its place in academia. From praising it as an ideal tool to cursing it as a dam on the flow of information. After much consideration, the conclusion I’ve reached is that the greatest sin of the reliance on IP by the modern model of technology transfer is the rather hum-drum one of narrow mindedness. Concentrating on using IP cuts off so many possibilities.

Using IP to Transfer Technology

Let’s begin with an overview of what the standard “IP to transfer” process looks like, then move onto some examples of how it could work in the open.

Because researchers are a naturally creative bunch it is no surprise that they often come up with new tools, or useful bits of software, or new methods for producing results. It is also unsurprising that upon inventing something they want others to use it. Under the traditional mode of technology transfer here is where the snags start. If you want to turn your invention into a product others can buy you have to:^[1]

The above can be a lengthy process, during which time there is no guarantee of success. Moreover, it is best if you don’t tell anyone too much about the invention (unless they sign a Non-Disclosure Agreement) because the more information potential competitors have the more time they have to invent a competing product, challenge your patent application, rush to file a competing patent before you, and myriad other cunning business practices.

Facing all of the above, many give up somewhere along the line. All they wanted was to make and share useful things and going through the rigamarole outlined above is simply too much. If that’s you, or someone you know, I want you to know that there are alternatives.

Transferring Technology in the Open

What most folks don’t know is that there is another option: share the information concerning how to make and use the invention with the potential user community and don’t bother spending the time and money to get IP (in fact, take some simple steps to make sure IP doesn’t get in the way). Then you can use non-IP dependent business models to continue development. Let’s look at a few kinds of invention where this path has proven to be possible.

Hardware and Equipment

Open hardware is made possible by technologies of information distribution, like sharing equipment designs and parts-lists via GitHub, and distributed manufacturing, like 3D printing. Using these technologies, inventions can be effectively “transferred” from creator to user at a fraction of the time and cost of the standard approach. Depending on the model, an OpenTrons automatic pipetting machine, for example, can cost approximately $10,000 USD, a fraction of the $50,000 to $100,000 price tag for similar, proprietary competitors.

The open model brings the added benefit of enabling the creation of a community of users who can help in product development. If a user finds a way of improving your product they can inform you of it (e.g., by submitting a new design) and, if you like it, that design can be incorporated. As long as the “no one is taking any IP in any of this” approach is clearly defined through a mission statement, contributor guidelines, and appropriate open-licenses (e.g., CERN Open Hardware License) you can eliminate the headache and delay usually inherent in obtaining and maintaining IP.

Software

TTOs also handle software. Some of the most valuable inventions these days are digital. A piece of software that makes a manufacturing process more efficient or can detect a key disease biomarker can have serious value. What TTOs don’t do is provide any help for doing so on an open basis. The option presented is to wade into the confusing, uncertain, and frustrating world of software patents.

Any piece of useful software released on an open-source basis can, by definition, be downloaded, used, and modified by anyone for free. Contrary to what you might think, the open source software industry is bigger than it ever has been. The success of Red Hat in distributing Linux or Wordpress taking over the blogging world is difficult to deny.

Does it Really Work?

“But if anyone can use it, and you can’t stop them with IP, how do you make money?” There are a variety of answers to this.

In the open hardware case, you can sell kits with all of the necessary parts collected together to make assembly easier, you can sell warranties and support in case something breaks, you can sell fully assembled units to people who don’t have the time, resources, expertise, or inclination to build it themselves, you can sell customization services in case a user needs a slightly altered version, you can sell training courses.

The above is just a small slice of the kinds of “accessory services” that can be sold independent of IP protection.

Similarly for open source software, you can sell customization services, or customer support, or warranties, or training programs, or employ an Open Software as a Service^[10] or Infrastructure as a Service^[11] business model. The number of viable open-source business models is growing all the time.

How grand would it be for universities to have a pathway for their researchers to design and disseminate open hardware and open source software and, if appropriate, create successful ventures that offer the kinds of accessory services users desire?

Unfortunately, that pathway does not yet exist. The issue is that a supported, open path has not been created by universities (if they even know it exists). It is not that academic institutions are hostile to the idea (usually) but that they continue their narrow-minded concentration on an IP-heavy path that is becoming less and less necessary.

New Paths (and the Need for More of Them)

The analogy I often trot out is that a tomato is open. Every tomato you buy comes loaded with seeds within which reside the entire assembly instructions to manufacture more tomatoes. Yet, instead of people putting tomato growers out of business by growing their own, people return to the store time and again. The fact is, most people do not have the time, resources, expertise, or inclination to grow something when they can buy it. The same is true of many existing and potential products, and the software and hardware examples I’ve discussed here are just the tip of the iceberg.

The open paths sketched above should be available to researchers. It is extremely likely that inventions that don’t fit into the narrow success criteria under the IP-heavy tech transfer approach will be transferred successfully.

There are, unfortunately, some significant barriers. University IP and Innovation Policies usually have absolutely nothing to say about open pathways, and in some cases are actively harmful to those who want to try them. I have found it consistently impossible to get a straight answer from TTOs about whether the university will try to make a claim to a portion of revenue if a researcher tries to use one of these models.The reason, I believe, isn’t that they want to be difficult but because under current policies there is no clear answer. These policies are in desperate need of being updated. Further, TTOs lack the expertise and relationships necessary to take advantage of open transfer pathways. I would bet a healthy sum that any 10 TTOs chosen at random would not be able to offer much useful advice when it comes to selecting a distribution platform, choosing an open license, or cultivating a user community. Those topics simply don’t apply within the narrow lane they’ve set for themselves.

One thing that gives me hope is that if you look at the website of most TTOs and navigate to their mission statement you likely won’t find anything about patents or other forms of IP. What you will find is that they exist to help make sure that inventions are transferred out into the world in a way that produces economic and social good. I think that is what we all want. So, when I meet with technology transfer officers, entrepreneurial researchers, or institutional leadership that is where I start. My job then is simply to convince them that establishing open paths should be added to their toolkit. Doing so can often be quite fun! Hopefully this post helps at least some of you do the same.

\\n

This is all assuming the standard model at play in North America, though many European colleagues I’ve talked to operate under similar regimes. ↩︎
\\n
To be fair, I am simplifying here. Institutions can and often do decide to file something called a provisional patent, which takes less time. Provisional patents, however, are essentially short-term placeholders that won’t provide full patent protection unless converted into a proper patent application. It all gets a bit hairy and into the IP weeds, hence my attempt to spare you through the simplification. ↩︎
\\n
Which means you also have to pay for it yourself which, given the fees of IP professionals, can be quite expensive (think at least $3000 just to write and file, often much, much more). That doesn’t include the costs of shepherding the patent through the review and revision process (what’s known somewhat confusingly as patent prosecution) which often adds many more thousands or tens of thousands of dollars. ↩︎
\\n
A microscope for fluorescence imaging. ↩︎
\\n
Electroencephalography devices. ↩︎
\\n
Automated pipetting machines. ↩︎
\\n
Medical equipment like tourniquets and stethoscopes. ↩︎
\\n
A microscope. ↩︎
\\n
A robotic farming system. ↩︎
\\n
This is how Wordpress works. ↩︎
\\n
If you’re interested, check out Linode. ↩︎
\\n

\\n

Since its release towards the end of 2022, ChatGPT has been dominating the majority of AI-related conversations on social media. One could almost say it has made AI more mainstream and accessible than ever. AI is quickly revolutionizing the modern-day research landscape. According to a CSIRO report, nearly 98% of scientific fields use AI in some way. The possibilities are endless, with these state-of-the-art AI technologies becoming more and more accessible.

AI tools are gradually making inroads into the research ecosystem. From breaking down research papers to make them more comprehensible and auto-completing academic essays to accurately predicting 3D models of protein structures, AI is streamlining multiple aspects of scholarly pursuit. In short, it can dramatically reduce the time researchers invest in routine tasks, giving them more time to think, focus on data and analysis, synthesize their thoughts, and make inferences.

This blog post is part of a series on how AI is reshaping the research landscape. In the first part, we will set the scene by examining the different ways AI applications are currently used in consuming and delivering research.

1. Knowledge discovery: Getting through papers faster

Gathering valuable insights from the sea of scientific manuscripts can be a daunting task. Given the sheer number of papers published each year — with close to 2.4 million annually — finding relevant papers and distilling critical insights from them is almost like finding a needle in a haystack.

It's a challenging feat for any researcher as there will always be unfamiliar terms, concepts, theories, and equations to cross-reference to understand the paper thoroughly. Plus, there may be questions that one would have to look up separately while making the connections between concepts. The difficulty further increases if you are a non-English speaker since three-quarters of science and humanities papers are written in English.

Fortunately, we now have AI-powered research reading tools that can help us navigate the vast array of papers and make sense of their content. SciSpace Copilot is one such tool. It helps to read and understand articles better by providing explanations for scientific texts and math, including follow-up questions for more detailed answers in multiple languages. Elicit allows researchers to access relevant papers with summarized takeaways. System is an open data resource that combines peer-reviewed articles, datasets, and models to help you understand the relationship between any two things in the world.

The list is growing, with more coming up every day. These tools aim to help researchers and science practitioners extract critical information and required context from research papers faster.

2. Communication enhancement: Articulating yourself better

Writing grant applications can take up a substantial amount of time, even for the most accomplished researchers. Some report that up to 50% of their time is dedicated to this process. On top of this, you have papers, emails, conference presentations, and even social media posts to write to disseminate your findings and make your research visible. While this is an important activity for advancing research, it is eating into the time you would spend refining your research and honing your analysis.

Generative AI model-based writing tools are tackling this challenge. A researcher used GPT-3, a large language model, to write an academic paper — complete with references. While it is probably not a good idea to use AI to write the whole piece, one can use it to bootstrap, explore different angles, and improve the content's tone, structure, and flow.

Lex is a word processor like Google Docs but interactive and AI-powered. Writefull X is an AI-powered writing application tailored to academia. Both help lighten the load, allowing you to focus on sharing your research findings rather than stressing about the actual writing.

3. Data analysis acceleration: Making sense of data faster

You can only analyze data when it is cleaned and organized. It means spending hours manually sorting and categorizing your data, which can be tedious, especially when dealing with large volumes of unprocessed data. On top of that, you might have to learn to use spreadsheet software and databases and, in some cases, coding languages like Python or R.

Thankfully, advancements in AI have made it possible to make sense of data faster and with less human effort. A wide range of AI tools that are currently available could help you each step of the way, from data extraction to data visualization and even predictive analysis.

Starting with AI-based spreadsheet bots that turn your instructions in natural language into a spreadsheet formula. So, suppose you want to find out the total number of survey respondents who answered 'yes' to a question in the age bracket 16-25. In that case, you could type the same (with column numbers), and the spreadsheet bot will create the formula to give you the answer you need. If you want to visualize the data, you have platforms like Olli that help you create line charts, bar graphs, and scatter plots by simply describing what you want.

It doesn't end there. OpenAI Codex is an AI model that translates natural language into code. This has been used to build GitHub's AI coding assistant, which gives you code suggestions in real-time, right from your editor. An MIT-based study revealed that you could use this model to solve university-level math problems consistently.

There are also AI-driven data analysis tools out there, like Lookup. You can upload the data, ask questions in plain English, and get answers quickly without learning complicated query languages or figuring out how various tables connect.

4. Publishing efficiency: Expediting the workflow

Getting a scholarly manuscript published is, again, a tedious process, with formatting, editing, proofreading, and the all-important peer-review cycle. On the one hand, you have authors spending 52 hours a year on average on formatting. On the other, journals reject around 75% of manuscript submissions before they even reach the peer review stage. These numbers indicate that there is room for improvement in the publishing workflow.

Integrating AI tools by both authors and publishers can streamline this process. On the author's side, AI-based solutions like Grammarly, Lex, Turnitin, and Writefull automate formatting, referencing, plagiarism checking, and grammar checks.

Journal publishers are also turning to AI to streamline the review process. For instance, American Association for Cancer Research (AACR) uses Proofig to verify the authenticity of images in submissions sent to their journals. Springer Nature adopted UNSILO, an AI-based platform, to identify links across eleven million published journal articles, enabling them to find related articles quickly. Penelope.ai is another AI-based tool that helps ensure that manuscripts meet a journal's requirements by quickly analyzing references and structure. AI is also being used for fact-checking. The potential for AI to optimize the journal publishing process is immense.

Final thoughts

AI models hold tremendous potential for the scientific research community. At the same time, there are serious concerns about employing such technology, ranging from plagiarism and ethical issues to the potential for replicating human biases, spreading false information, and ethics violations. Research teams and other stakeholders must join forces to guarantee that Artificial Intelligence-driven research systems are responsibly constructed and used.

AI is still evolving, and expecting it to always produce reliable results is unrealistic. After all, it's only been around five years since the release of Attention is all you need, a groundbreaking paper that introduced Transformer — an NLP model considered the foundation of many of today's AI models. Fortunately, the early signs of progress are encouraging, and continued developments are anticipated. We could expect better generation capability and factual consistency from the Large Language Models in the near future.

Even so, AI can still create inaccurate output. So, when employing AI in your workflow, ensure to double-check all outcomes before relying on them.

In the next edition of this series, we will look at how AI is helping researchers overcome language barriers. Stay tuned! Thanks for your time in reading this post. Please feel free to contact us at saikiran@scispace.com, for any questions or thoughts. All the images in this post are created with Text2Image AI tool DALL·E 2

Researchers, librarians, policy makers, and practitioners often complain about the scholarly publishing system, but the system also offers exciting opportunities to contribute to innovations in the way academic findings are disseminated and evaluated. At the Dutch Open Science Festival, which took place at the Vrije Universiteit Amsterdam on September 1st 2022, we organized one of the ‘community-led’ workshops to discuss some of these developments, focusing on preprints and open preprint review. Participants discussed the opportunities offered by these innovations, and reflected on ways in which these innovations may complement, or perhaps even replace, traditional journal publishing practices.

Preprints

An important development in scholarly publishing is the increasing adoption of preprints as a way to accelerate the dissemination of academic findings^1^. Preprinting is a fairly established practice in fields such as physics, mathematics and computing science (e.g. arXiv), and also in the form of working papers in some fields, such as economics. In recent years it has also gained significant popularity in biomedical fields (e.g. bioRxiv and medRxiv) and in the social and behavioral sciences (e.g. SocArXiv and PsyArXiv). In addition to discipline-specific preprint servers, discipline-independent servers (e.g. OSF Preprints, preprints.org, Research Square, and SSRN) are also increasingly being used. Some of these servers are non-profit, while others are owned by commercial publishers.

To set the stage, we started by discussing the main characteristics of preprints. Definitions can vary, especially amongst disciplines. But in general, preprints (or “working papers”) are an early version of a paper, chapter or other publication, before formal peer review. They are published online on a preprint server either before or upon submission for more ‘formal’ publication, typically in a peer-reviewed journal. Preprints can have updated/corrected versions, e.g., based on comments from peers or community feedback. Nowadays, almost all journals and publishers allow authors to publish their work as a preprint^2^.

Preprints are used quite widely, and can be positively disruptive to a slow and intransparent scientific communication system. Nevertheless, in many academic fields preprinting is not yet practiced by the majority, despite the ‘Corona-boost’ (see figure), and practices and norms around preprinting are therefore still in development. As a result, there may be uncertainty about the status and acceptable usage of preprints vis-à-vis peer-reviewed articles in journals. Also, the financial sustainability of preprint servers still represents a significant challenge^3^. In addition, while many institutions, funders, and societies are promoting or mandating open access publishing, they usually do not actively encourage preprinting, which may explain the relatively slow uptake of preprints. And last but not least, despite the sanity checks performed by preprint servers, there is a risk of dissemination of pseudoscience. While most preprints present solid research, there are also preprints that report on lower-quality work and that may make unsubstantiated claims.

Preprint peer review

Unlike articles published in scholarly journals, preprints typically have not been peer reviewed. A recent development is the emergence of platforms for open peer review of preprints. These platforms complement the traditional closed journal peer review system. They typically aim to make peer review more transparent and more efficient. Examples of platforms for preprint peer review are Peer Community In, PeerRef, preLights, PREreview, and Review Commons. A crucial characteristic of these peer review platforms is that they are all independent of journals and independent of the preprint platforms.

Future development - three scenarios

The growth in preprinting and the emergence of preprint peer review platforms raises interesting questions about the future development of the scholarly publishing system. While it is too early to make strong predictions, we did outline three scenarios in our workshop for the relation between preprints, journals and peer review.

Scenario 1: The mixed system

In this scenario, there is a mixed system in which journals, preprint servers, and peer review platforms co-exist in a loosely coupled way. Journals and preprint servers operate independently from each other, but there can be relations between them, for instance to enable authors to submit their work simultaneously to a journal and to a preprint server^4^. Likewise, there are initiatives for peer review where peer reviews are made available alongside a preprint, while also being used by journals for deciding whether to accept the preprinted article for publication in the journal. Peer review reports (of both journal articles and preprints) can also be published on dedicated platforms, both by journals and by individual reviewers, either invited by journals, by peer review services, by authors, or not invited at all.

Scenario 2: The extended journal

In this scenario, journals broaden the services they offer to include preprinting and open peer review, leading to what may be referred to as the ‘extended journal’. Preprinting and open peer review become fully integrated elements in the workflows of journals. Scholarly publishing remains organized around journals, but journals start to perform functions that they did not perform traditionally.

Examples of this could be found at Copernicus Publications, with their Interactive Peer Review, but also at the F1000 platforms, which are currently also whitelabeled and used by research funders (e.g. Wellcome Open Research and the European Commission’s Open Research Europe platform). The integration of SSRN and Research Square in the submission workflows of Elsevier and Springer Nature journals is another example of a development in the direction of the extended journal.

Scenario 3: Moving away from journal-based publishing

In this scenario, the scientific community increasingly recognizes the value of preprint servers and peer review platforms. This leads to a situation that may be considered the opposite of the ‘extended journal’ scenario discussed above. Rather than taking on additional functions, journals choose to unbundle their services. The dissemination function of journals is going to be performed by preprint servers (although the term ‘preprint’ may no longer be appropriate), while the evaluation function is going to be performed by peer review platforms. In the most extreme variant of this scenario, journals completely cease to exist and scholarly publishing takes place entirely on preprint servers and peer review platforms.

The ‘publish, review, curate’ model promoted by eLife, Peer Community In and the Notify-project by the Confederation of Open Access Repositories (COAR) are important steps in this direction. An important issue that we did not address in the workshop is how such a system without journals performs the functions of disciplinary communities and topical filtering that journals currently have. By taking advantage of developments in filtering technologies, we expect that preprint servers and peer review platforms will increasingly be able to perform these functions.

Publish Your Reviews

At the end of our workshop, we presented the Publish Your Reviews initiative, a new community-based initiative developed by one of us together with ASAPbio. Building on the above-mentioned developments, Publish Your Reviews encourages researchers to combine journal peer review with preprint peer review, aiming to increase the value of preprints and to make peer review more useful and more efficient. Supported by publishers and other organizations, the initiative calls on researchers to publish the reviews they submit to journals and to link these reviews to the preprint version of the article under review. Researchers that support Publish Your Reviews are invited to sign a pledge.

Next steps

The workshop participants showed a great interest in the above-mentioned developments. At the end of the workshop, many participants had a concrete plan for contributing to these developments. Some of them announced they are going to publish their own reviews, while others plan to promote preprinting and preprint peer review in their community. There were also participants that are going to consider how preprinting and preprint peer review can be given appropriate recognition in hiring and promotion policies.

Further support for innovating scholarly publishing was given at the end of the Open Science Festival, when Robert Dijkgraaf, Minister of Education, Culture and Science, strongly criticized the current publishing system. We hope the words of the minister will encourage everyone in the Dutch research community and beyond to support efforts to innovate scholarly publishing!

References:
1. See for example: Chiarelli, A. et al. (2019). Accelerating scholarly communication: The transformative role of preprints. Zenodo. https://doi.org/10.5281/zenodo.3357727 and Waltman, L., et al. (2021). Scholarly communication in times of crisis: The response of the scholarly communication system to the COVID-19 pandemic (Version 1). Research on Research Institute. https://doi.org/10.6084/m9.figshare.17125394.v1.

2. The JISC service Sherpa Romeo offers information about journals’ policies on posting different versions of a research article (preprint, postprint and Version of Record).

3. Penfold, N. (2022). The case for supporting open infrastructure for preprints: A preliminary investigation. Zenodo. https://doi.org/10.5281/zenodo.7152735

Bibliographic databases play a crucial role in scholarly literature discovery and research evaluation. How do these databases respond to the preprint revolution? The traditional focus of these databases has been on articles published in peer-reviewed journals, but this is gradually changing. For instance, in 2021, Scopus announced that it had started to index preprints. And earlier this month, Web of Science announced the launch of its Preprint Citation Index.

In this blog post, we discuss how preprints are indexed by different bibliographic databases and we present recommendations for optimizing the indexing of preprints.

Indexing of preprints in bibliographic databases

Our focus is on five popular bibliographic databases: Dimensions, Europe PMC, the Lens, Scopus, and Web of Science. Europe PMC and the Lens are freely accessible. Europe PMC has also adopted the Principles of Open Scholarly Infrastructure. Scopus and Web of Science require a subscription. Dimensions has a freely accessible version with limited functionality. A subscription is needed to access the full version. We take into account only information made available through the web interfaces of the various databases. Some databases may contain additional information that is not accessible through their web interface, but we do not consider such information.

Databases such as Crossref, OpenAlex, and OpenCitations do not have an easy-to-use web interface, making them less interesting for end users. We therefore do not discuss these databases. We do not consider Google Scholar either. While Google Scholar offers an important search engine for scholarly literature, the underlying database is hard to access. We also do not consider PubMed, since its indexing of preprints is still in a pilot phase.

Dimensions and the Lens both index preprints from a large number of preprint servers across all disciplines. In the case of the Lens, it is important to be aware that some preprints, in particular from arXiv and SSRN, incorrectly have not been assigned the publication type ‘preprint’. Europe PMC also covers a large number of preprint servers, but due to its focus on the life sciences, it does not index preprints from servers such as arXiv and SSRN (except for COVID-19 preprints). The recently launched Preprint Citation Index in Web of Science covers five preprint servers: arXiv, bioRxiv, ChemRxiv, medRxiv, and Preprints.org. A number of large preprint servers, including OSF Preprints, Research Square, and SSRN, are not (yet) covered by Web of Science. Scopus covers arXiv, bioRxiv, ChemRxiv, medRxiv, Research Square, SSRN, and TechRxiv. It does not (yet) index preprints from servers such as OSF Preprints and Preprints.org. Moreover, indexing of preprints in Scopus has two significant limitations: Preprints published before 2017 are not indexed, and preprints are not included in the document search feature in Scopus. Preprints are included only in the author profiles that can be found using the author search feature.

Challenges of indexing preprints

Indexing of preprints raises a number of challenges, which are addressed in different ways by the different bibliographic databases. We consider four challenges: version history, links to journal articles, links to peer reviews, and citation links.

A preprint may have multiple versions. Ideally a bibliographic database should present a version history. This enables users to see when the first version of a preprint was published and when the preprint was last updated. Europe PMC and Web of Science do indeed present version histories. The other databases do not provide this information.

Many articles are first published on a preprint server and then in a journal. Bibliographic databases should show the link between a preprint and the corresponding journal article. Dimensions, Europe PMC, and Web of Science show these links. The Lens and Scopus do not show them.

It is increasingly common for preprints to be peer reviewed outside the traditional journal peer review system. Peer review of preprints may take place on platforms such as Peer Community In, PeerRef, PREreview, and Review Commons. The reviews are typically made openly available. Several research funders have made a commitment to recognize peer-reviewed preprints in the same way as peer-reviewed journal articles. To support these developments, bibliographic databases should index not only preprints but also peer reviews and should link preprints to the corresponding reviews. None of the databases currently offer this feature. Europe PMC provides links from preprints to peer reviews, but these links refer to external platforms. Europe PMC itself does not index peer reviews. The Lens does index peer reviews, but the reviews are not linked to preprints.

Citation links represent another challenge. In its citation statistics, Scopus does not include citations given by preprints. The other databases do include such citations. This raises the issue of duplicate citations. When an article has been published both on a preprint server and in a journal, the preprint version of the article and the journal version will typically have a similar or identical reference list. A publication that is cited by the article may then receive two citations, one from the preprint version of the article and one from the journal version. None of the databases deduplicate such citations.

Recommendations for indexing preprints

In the box below, we present six recommendations for optimizing the indexing of preprints in bibliographic databases. As we will discuss later, implementing these recommendations requires close collaboration between bibliographic databases and other actors in the scholarly publishing system.

\U0001F4A1

Recommendation 1: Cover all relevant preprint servers.
A bibliographic database should index preprints from all relevant preprint servers. A disciplinary database (e.g., PubMed and Europe PMC) should index preprints from all preprint servers relevant in a particular discipline. A multidisciplinary database (e.g., Dimensions, the Lens, Scopus, and Web of Science) should index preprints from all preprint servers across all disciplines.

Recommendation 2: Provide comprehensive preprint metadata.
A bibliographic database should provide metadata for preprints that is as comprehensive as metadata for journal articles. The metadata should at least include the title and abstract of a preprint, the names and affiliations of the authors, the reference list, and funding information. It should also include a version history.

Recommendation 3: Provide links between preprints and journal articles.
If an article has been published both on a preprint server and in a journal, a bibliographic database should provide a link between the preprint and the journal article. The link establishes that the preprint and the journal article are different versions of the same article. The preprint and the journal article belong to the same publication family.

Recommendation 4: Provide links between preprints and peer reviews.
If a preprint has been peer reviewed and the reviews have been made openly available, a bibliographic database should index the reviews and should provide links between the preprint and the reviews.

Recommendation 5: Provide deduplicated citation links between publication families.
A bibliographic database should provide deduplicated citation links at the level of publication families. If there are multiple citation links from publications in one publication family (e.g., from a preprint and from a journal article) to publications in another publication family, these citation links should be deduplicated.

Recommendation 6: Do not make arbitrary distinctions between publication types (preprints, journal articles, and others).
A bibliographic database should not make arbitrary distinctions between preprints, journal articles, and other publication types. A database may inform its users about relevant differences between publications of different types (e.g., whether publications have been peer reviewed or not), but otherwise it should treat all publications in the same way, regardless of their publication type.

The table below summarizes the extent to which different bibliographic databases meet our six recommendations. Two stars are awarded if a database fully meets a recommendation, one star is awarded if a recommendation is partly met, and no stars are awarded if a recommendation is not met at all.

Dimensions, Europe PMC, and the Lens seem to cover all relevant preprint servers, resulting in two stars for the first recommendation. We have awarded one star to Scopus and Web of Science. They still need to work on improving their coverage of preprint servers. Web of Science informed us that in the coming year it expects to substantially increase the number of preprint servers it covers.

None of the bibliographic databases provide comprehensive preprint metadata. The databases make available basic metadata fields such as the title, abstract, and publication date of a preprint as well as the names of the authors. Sometimes they also provide more advanced metadata fields, for instance the reference list, author affiliations, and funding information, but in many cases these metadata fields are missing. Based on some spot checking, our impression is that each database has its own strengths and weaknesses in terms of the completeness of preprint metadata. For instance, while some databases provide more comprehensive metadata for arXiv preprints, others do a better job for bioRxiv preprints. In general, metadata seems to be more complete for recent preprints than for older ones. Since all databases suffer from gaps in their preprint metadata, we have awarded one star to each of them. As we will discuss below, improving the availability of preprint metadata is a joint responsibility of bibliographic databases and preprint servers.

The Lens and Scopus do not provide links between preprints and journal articles, although the Lens told us that they are working on providing such links. Dimensions, Europe PMC, and Web of Science do provide links between preprints and journal articles. However, because the links seem to be incomplete, we have awarded only one star to these databases.

Europe PMC is the only database that provides links from preprints to peer reviews, but it does not index the reviews. We have therefore awarded one star to Europe PMC. We have also awarded one star to the Lens. While the Lens does not provide links between preprints and peer review, it does index reviews in its database.

The figure below illustrates the idea of deduplicating citation links. There are three publication families, each consisting of a preprint (in green) and a corresponding journal article (in blue). Before deduplication, there are six citation links. After removing duplicate citation links between publication families, only three citation links are left. Of the five bibliographic databases considered in this blog post, none provides deduplicated citation links between publication families. The Lens informed us that they may provide deduplicated links in the future.

Dimensions, Europe PMC, and the Lens treat preprints in the same way as journal articles. They therefore fully meet our sixth recommendation, resulting in two stars. Web of Science makes a sharp distinction between preprints and journal articles by including preprints in a separate Preprint Citation Index and by presenting citation information separately for citations from preprints and citations from journal articles. Moreover, citations from journal articles are shown more prominently than citations from preprints, except in the Preprint Citation Index, where citations from preprints are displayed more prominently. This seems inconsistent and may confuse users. Because of this, we have awarded only one star to Web of Science. Scopus has not been awarded any stars. The document search feature in Scopus enables users to search for journal articles, but not for preprints. This is a highly arbitrary distinction between these two publication types. Scopus also excludes citations from preprints from the citation statistics it provides.

The above table shows that there is ample room for bibliographic databases to improve their indexing of preprints. Users of bibliographic databases will increasingly be interested in preprints, in addition to journal articles and other more traditional publication types. We advise users interested in preprints to make sure they use a database that serves their needs, and we hope the above table will help them make the right choice.

Improving indexing of preprints - The need for joint action

Bibliographic databases are part of a larger ecosystem of infrastructures for scholarly publishing. To allow bibliographic databases to improve their indexing of preprints, other actors in this ecosystem also need to take action.

First of all, preprint servers need to make available as much preprint metadata as possible. Most preprint servers register DOIs for preprints at Crossref. This enables them to make preprint metadata available by depositing the metadata to Crossref. Preprint servers are indeed submitting metadata to Crossref, but there is still considerable room for improvement. Metadata submitted to Crossref can be harvested by bibliographic databases, helping them to provide comprehensive metadata for the preprints they index. One database, Scopus, informed us that to obtain high-quality preprint metadata it needs to scrape the metadata from the websites of preprint servers. In a well-organized infrastructure ecosystem, there should be no need for web scraping.

Second, journal publishers need to add links from the articles they publish in their journals to the corresponding preprints. These links can be included in the article metadata that publishers deposit to Crossref. Unfortunately, with a few exceptions (Copernicus, eLife), publishers are doing a poor job in linking journal articles to preprints. Publishers need to work together with the providers of manuscript submission systems (Editorial Manager, Open Journal Systems, ScholarOne, etc.) to improve this.

Third, preprint servers and peer review platforms need to work together to allow bibliographic databases to harvest links between preprints and peer reviews. Ideally, each peer review should have its own DOI, and peer review platforms should include links from peer reviews to the corresponding preprints in the metadata they submit to the DOI registration agency (Crossref, DataCite). For peer reviews without a DOI, other ongoing infrastructure initiatives, in particular the COAR Notify Initiative, DocMaps, and Sciety, offer promising approaches for linking preprints and peer reviews.

Outlook

Without doubt, the adoption of preprinting will continue to increase in the coming years. Preprint peer review will also grow in importance, especially in the life sciences. As a result of the ongoing transition toward a culture of open science, researchers will increasingly share their work in an early stage, prior to peer review. Sharing intermediate results, such as research plans (preregistration), will also become more common. In addition, the boundaries between different types of publications will get increasingly blurry. For instance, eLife, Peer Community In, and the various F1000 platforms essentially represent hybrids of preprinting and traditional journal publishing. Their popularity will make the distinction between preprints and journal articles more and more fuzzy.

To keep up with these developments, bibliographic databases need to innovate. This is the case in particular for Scopus and Web of Science, the two oldest databases. Scopus has a selective coverage policy. The same applies to the Core Collection of Web of Science. Rather than trying to cover as many journals as possible, these databases cover only journals that are considered to meet certain quality standards. This philosophy of selectivity is difficult to maintain in a world in which sharing of non-peer-reviewed research results is not only becoming more accepted, but is even gradually becoming the norm. It seems essential for Scopus and Web of Science to work toward providing a more comprehensive coverage of the scholarly literature. Indexing of preprints is an important step in this direction.

We expect users of bibliographic databases to increasingly move away from the idea that a database should filter the scholarly literature for them by indexing only high-quality content. Users will instead expect a database to offer tools that allow them to filter the literature themselves. This means that bibliographic databases need to provide a comprehensive coverage of the literature and need to help users answer questions such as: Does this publication present the final results of a research project or does it report provisional intermediate findings? And what kinds of quality checks has the publication undergone? Has it been peer reviewed? And if so, are the peer reviews openly available? Or is there other information available about the nature of the peer review process? And have data and code underlying the publication been made openly available? Enabling users to obtain the best possible answers to these types of questions is the new holy grail for bibliographic databases.

We thank Dimensions, Europe PMC, the Lens, Scopus, and Web of Science for their feedback on a draft version of this blog post. We are also grateful to Iratxe Puebla, Martyn Rittman, and several colleagues at the Centre for Science and Technology Studies (CWTS), Leiden University for their feedback.

EDIT [2023-02-21]: Some minor remarks were added regarding the Lens (they're adding preprint links and may provide deduplicated links).

Research data and software rely heavily on the technical and social infrastructure to disseminate, cultivate, and coordinate projects, priorities, and activities. The groups that have stepped forward to support these activities are often segmented by aspects of their identity - facets like discipline, for-profit versus academic orientation, and others. Siloes across the data and software publishing communities are even more splintered into those that are driven by altruism and collective advancement versus those motivated by ego and personal/project success. Roadblocks to progress are not limited to commercial interests, but rather defined by those who refuse to build on past achievements, the collective good, and opportunities for collaboration, insisting on reinventing the wheel and reinforcing siloes.

In the open infrastructure space, several community-led repositories have joined forces to collaborate on single integrations or grant projects (e.g. integrations with Frictionless Data, compliance with Make Data Count best practices, and common approaches to API development). While it is important to openly collaborate to fight against siloed tendencies, many of our systems are still not as interoperable as they could and should be. As a result, our aspirational goals for the community and open science are not being met with the pacing that modern research requires.

In November 2022, members of open, community-led projects that support research data and software came together during an NSF-funded workshop to address the above concerns and outline actionable next steps. As builders and educators that regularly work and collaborate together, we wanted to leverage our trusted relationships to take a step back and critically examine and understand the broader infrastructure, systems, and communities above and around us that influence our success and goals. Through this process, we identified and categorized key problem areas that require investment: lack of standards, false solutions, missing incentives, cultural barriers, the need for training and education, opportunities for investment in new/existing infrastructure, need for support for sensitive data, and lack of guidance/support for leadership in our communities.

From there, we built the first steps of a plan to collectively improve the scalability, interoperability and quality of data and software publishing.

If we did it right, what would it look like

Make data and software publishing as easy as possible – not just at one repository

Many data and software repositories market their submission processes to be seamless. But seamless workflows within a single repository are not enough. Researchers are not committed to a single repository, and data often need to be linked across repositories for successful reuse. Drastically disparate processes hinder us from meeting our goals. Making data and software publishing as easy as possible requires us to look upstream and invest in more accessible and interoperable tooling, as well as in education and training.

Investments in infrastructure should prioritize tools that focus on pipelines from data and software to repositories, while remaining platform agnostic and openly pluggable. Workflows differ across disciplines, but basic command line functions for API submission to repositories should be a baseline requirement.

There is also room to develop more educational investments in the form of training materials and experiences, beginning at the high school and undergraduate level or earlier, when students are discovering personal interests and eager to communicate and share their ideas. Students and non-students alike can nurture their excitement through hands-on exploration with data and code. The intersection of an interesting topic with critical questions, such as what is data and software publishing, how do I publish data and software, and basic computational skills for seamless publishing, can provide opportunities to nurture lifelong discovery and sharing. Groups like The Carpentries are optimal for building out needed modules and upskilling researchers earlier in their careers.

As we improve the processes for getting data and software into diverse repositories, we can begin to think about what it would look like for our repositories to expose standard sets of disciplinary metadata to allow for automated linkages. This type of work would drastically change search capabilities, required for use and reuse of all research outputs, including data and software.

Build scalable processes for ensuring quality and reusability
Repositories have varying levels of support for assessing the quality of hosted data and software, ranging from curation services and automated validation of files or metadata, to documenting and enforcing best practices. This work should be coordinated across repositories to ensure researchers can easily understand expectations and leverage standards; importantly, these expectations should be built into curriculum upstream from the data and software submission processes.

There are emerging standards for this work. They rely on building a clear discipline-specific understanding of the data and software and offer contextual support to create machine-readable information about the files. These standards and approaches must be coordinated and offered at scale across multiple repositories in order for them to be successfully adopted and iterated on at the rate researchers need.

Researchers reusing data should understand the usability and the fit for purpose of each dataset or software package. This cannot be adequately addressed by a mere badging scheme. To properly address this challenge and support trust and reuse, effective interfaces are needed to gauge the level of metadata and quality of data and software up front. Importantly, and unrelated to disciplinary metadata, there must be an emphasis on provenance. Data and software published without provenance information should be flagged as untrustworthy.

Examples of tools for this type of work are available. For software, the HERMES workflow tool can record basic software provenance - e.g. contributorship and authorship, versioning, dependencies - and records the provenance of the software metadata itself during the submission preparation process. For data, leveraging strategies such as data containerization facilitates the use of flexible open source tooling for manipulating and processing data during curation. Frictionless Data’s Data Package Pipelines employs reusable and structured declarative statements to execute routine actions on the behalf of the data curator; creating machine readable package manifests and provenance records, while decreasing human error and increasing efficiency in data processing. We know that investments into and adoption of these types of tools are essential to our greater success.

Launch a productive community for change
Broad coalitions across research data and software platforms exist and have a place in defining community benefit and approaches. However, they also stand in our way of action. We need a venue to openly discuss ideas, and we need to trust that collaborators will offer resources openly and productively, not just showing up for attendance’s sake but rather, be invested in building a fertile ecosystem together. While it may be unpopular in some circles, this will mean building an exclusionary space. One where the members have pledged to support the collective benefit over individual reward.

This type of community already exists. We just haven't formalized it. Now is the time to move quickly toward our common goals. This type of space is required for coordination across stakeholders to build clear examples of the ROI of our investments into data and software publishing, and better integrate leadership (across all stakeholders) into the conversation.

Committing resources and intentional work - and not just showing up

Achieving scalable, high-quality, interoperable data and software publishing is possible. There are already builders, some represented by the authorship of this article, that are on the right path, building tools that effectively meet the needs of researchers in an open and pluggable way. One example is InvenioRDM, a flexible and turn-key next-generation research data management repository built by CERN and more than 25 multi-disciplinary partners world-wide; InvenioRDM leverages community standards and supports FAIR practices out of the box. Another example of agnostic, pluggable tooling, in this case for software submission, are the submission workflow tools currently developed in the HERMES project. These allow researchers to automate the publication of software artifacts together with rich metadata, to create software publications following the FAIR Principles for Research Software.

Meaningful progress and lasting success requires people to do real work and commit real resources. Success requires community-led and community-serving projects across multiple scholarly and research communities to rally behind and support those driving progress in data and software publishing and adoption of best practices and community standards that enable a bright, interoperable, and function-forward scholarly ecosystem. Success will also depend upon transparency to shine a light on vanguards leading this journey, as well as exposure and an understanding of conflicting motivations and interests that prioritize good PR and drain energy and resources from the community. Success ultimately requires true collaboration, with a mindset of “for the community - long-term” as opposed to “for my project - right now”, and focused action to deliver results and solutions.

The time is now! We are highly committed to this vision by working together to build the community and technical structures needed to finally advance data and software publishing across research disciplines.

John Chodacki, California Digital Library
Maria Praetzellis, California Digital Library
Gustavo Durand, Harvard
Jason Williams, Cold Spring Harbor Laboratory
Stefano Iacus, Harvard
Adam Shepherd, BCO-DMO
Danie Kinkade, BCO-DMO
Kristi Holmes, Northwestern/ InvenioRDM
Maria Gould, California Digital Library
Matt Carson, Northwestern
Stephan Druskat, German Aerospace Center (DLR)
Britta Dreyer, DataCite
Jose Benito Gonzalez, Zenodo/CERN
Kristian Garza, DataCite
Steve Diggs, California Digital Library
Lars Holm Nielsen, Zenodo/CERN

Infrastructure: what’s at stake

Infrastructure often is perceived as a “given”, as something that was always there, as “natural”. In the digital age, infrastructure seems more “natural” than ever (it's hard to imagine there was a time without internet connectivity in our mobile phones or even a time when phone lines were a luxury item) and the social and economic dimension of infrastructure tend to be invisibilized and are left out of discussion.

Another challenge around infrastructure is that if well functioning it tends to be forgotten, despite the big amount of resources and commitment invested on its maintenance. Academic knowledge production infrastructure gathers very low interest in the public debate (except during times of crisis!) even if it serves society as a whole. It is this invisibilization that puts infrastructure permanently at risk, especially under the neoliberalist paradigm.

In scholarly communications, knowledge dissemination and infrastructure are complexly intertwined. Over the last 40 years, large commercial publishers have increased their control of the scientific output (according to a study by Vincent Larivière, Stefanie Haustein and Philippe Mongeon ten years ago the five most prolific commercial publishers accounted for more than 50% of all papers published) and this trend has been intensified by the digitization of research.

These commercial publishers have moved from publishing to also owning vendor systems such as submission and journal management systems, repositories, current research information systems (CRIS), faculty information systems (FIS), funders systems, and beyond in a move that can be called “appetite for acquisition”. This has several effects and poses many concerns across the research life cycle: increasing dependence by researchers and institutions, lack of competition from smaller organizations offering services, resulting in loss of community involvement and control across knowledge production and its communication.

With the digitization of scholarship and the rise of the open research movement, new models and outputs of science communication have emerged beyond the journal article. Scholarly communications is shifting towards the “record of versions”, rather than just a One True “version of record”, where persistent identifiers and their metadata enable recognition, linking and discoverability of a wide range of outputs regardless of where those are housed. It is worth noting the importance of infrastructure in connecting all outputs and resources throughout the research lifecycle (such as research data, software, samples, etc.) to better understand and evaluate the contributions to research, and support their recognition. Many of the organizations providing this kind of foundational infrastructure have been established as non profit community governed and sustained initiatives (Crossref, DataCite, ROR), and are committed to the Principles of Open Scholarly Infrastructure.

The Latin American perspective

Over the past few years, the open science movement has emerged with such strength that it’s pushing publishers to change their strategies and business models, the decline from the subscription based model and the transition from the pay to publish model (vs the former “pay to read” model) being some of the most notorious examples. While many have received initiatives like Plan S as a positive action to accelerate the transition to open access, beyond the Global North some voices have emerged to question these measures as progress.

Do article processing charges (APCs) and transformative agreements promote openness or strengthen the current status quo? Many then have turned “South” and started looking at “alternative” models that aren’t dependent on commercial publishers and for-profit service providers.

Despite this Latin American open access tradition, in countries like Colombia, APC payments are increasing. Many voices in the community argue that transformative agreements might threaten the current local ecosystem as the more funds that are allocated for APCs diminish the investment in shared infrastructure and tools.

When it comes to infrastructure, just being open might not be enough; operating infrastructure is not simple and requires investment, capacity building, maintenance, and dedicated staff committed that can ensure accessible, inclusive, and responsive tools. Resilience and sustainability are very sizable challenges that need to be addressed via governance.

This is part of the reason why using infrastructure from private vendors can be sometimes appealing. There's an ease to it and of course there’s a price to this, both in the literal and in the surveillance-capitalism sense.

A local solution in Latin America for this has been to have independent and self-sustainable organizations that not only publish research results but also fill a role in offering infrastructure and innovation options, training to improve research publishing and dissemination practices, allowing local communities to operate according to the state of the art in scholarly communications, such as SciELO (established in 1998 and celebrating its 25th anniversary with an international conference this year), Redalyc (2002), La Referencia (2010).

These types of initiatives are usually built with government funding, whether that’s directly or indirectly. In the end, one of their main contributions is to enable scaling, ensuring resilience and independence from economically exploitative practices — more importantly it’s about putting local communities and networks in the center and giving them control of the knowledge they produce.

In the end, open goes beyond access and it’s indispensable for our community to question and rethink the ownership and diversity of research infrastructure. There is an urgent need to reclaim scholarly infrastructure if we want to pursue the benefit of the majority instead of the profit of few. There are many ways to play a more proactive role in steering research infrastructure: (choosing and) using open community-led infrastructure and services, through institutional membership, sharing use cases and feedback for improvement, participating in governance and working groups and more.

Within this framework we want to introduce a series of interviews to showcase Latin American actors driving non-commercial community-driven infrastructure for the region. Stay tuned!

DISCLAIMER: the authors of this interview series work at DataCite and SciELO, respectively, the opinions expressed in this post are their own and don’t necessarily represent those of their employers.

Although advances in artificial intelligence (AI)¹ have been unfolding for over decades, the progress in the last six months has come faster than anyone expected. The public release of ChatGPT in November 2022, in particular, has opened up new possibilities and heightened awareness of AI's potential role in various aspects of our work and life.

It follows that in the context of the publishing industry, AI also holds the promise of transforming multiple facets of the publishing process². In this blog post, we begin the development of a rough taxonomy for understanding how and where AI can and/or should play a role in a publisher’s workflow.

We intend to iterate on this taxonomy (for now, we will use the working title ‘Scholarly AI Taxonomy’).

Scholarly AI Taxonomy

To kickstart discussions on AI's potential impact on publishing workflows, we present our initial categorization of the \\\"Scholarly AI Taxonomy.\\\" This taxonomy outlines seven key roles that AI could potentially play in a scholarly publishing workflow:

The above is the first pass at a taxonomy. To flesh out these further, we have provided examples to illustrate each category further.

We thoroughly recognise that some of the examples below, when further examined, may be miscategorized. Further, we recognise that some examples could be illustrations of several of these categories at play at once and don’t sit easily within just one of the items listed. We also acknowledge that the categories themselves will need thorough discussion and revision going forward. However, we hope that this initial taxonomy can play a role in helping the community understand what AI could mean for publishing processes.

Also note, in the examples we are not making any assertions about the accuracy of AI when performing these tasks. There are a lot of discussions already on whether the current state of AI tools can do the following activities well. We are not debating that aspect of the community discussion; that is for publishers and technologists to explore further as the technology progresses and as we all gain experience using these tools.

These categories are only proposed as a way of understanding the types of contributions AI tools can make. That being said, some of the below examples are more provocative than others in an attempt to help the reader examine what they think and feel about these possibilities.

Initial categorization

1. Extract - Identify and isolate specific entities or data points within the content

In the extraction stage, AI-powered tools can significantly streamline the process of identifying and extracting relevant information from content and datasets. However, an over-reliance on AI for this task can lead to errors if the models are not well-tuned or lack the necessary context to identify entities accurately. Some speculative examples:

2. Validate - Verify the accuracy and reliability of the information

AI-based systems can validate information by cross-referencing data against reliable sources or expected structures, ensuring content conformity, accuracy and/or credibility. While this can reduce human error, it is essential to maintain a level of human oversight, as AI models may not always detect nuances in language or identify reliable sources. Some examples:

3. Generate - Produce new content or ideas, such as text or images

AI can create high-quality text and images, saving time and effort for authors and editors. However, the content generated by AI may contain factual inaccuracies, lack creativity, or inadvertently reproduce biases present in the training data, necessitating human intervention to ensure accuracy, quality, originality, and adherence to ethical guidelines. Some examples:

4. Analyse - Examine patterns, relationships, or trends within the information

AI-driven data analytics tools can help publishers extract valuable insights from their content, identifying patterns and trends to optimize content strategy. While AI can provide essential information, over-reliance on AI analytics may lead to overlooking important context or misinterpreting data, requiring human analysts to interpret findings accurately. Some examples:

5. Reformat - Modify and adjust information to fit specific formats or presentation styles

AI can reformat content for specific media channels or alternative structures, enhancing user experience and accessibility. However, AI-generated formatting may not always be ideal or adhere to specific style guidelines, requiring human editors to fine-tune the formatting. Some examples:

6. Discover - Search for and locate relevant information or connections

AI can efficiently find and link information about a subject, streamlining the research process. However, AI-driven information discovery may yield irrelevant, incorrect, or outdated results, necessitating human verification and filtering to ensure accuracy and usefulness. Some examples:

7. Translate - Convert information from one language or form to another

AI can quickly translate languages and sentiments, making content more accessible and understandable to diverse audiences. However, AI translations can sometimes be inaccurate or lose nuances in meaning, especially when dealing with idiomatic expressions or cultural context, necessitating the involvement of human translators for sensitive or complex content. Some examples:

Balancing AI and Human Intervention in Publishing Workflows

There is potential for AI to benefit publishing workflows. Still, it's crucial to identify where AI should play a role and when human intervention is required to check and validate outcomes of assisted technology. In many ways, this is no different to how publishing works today. If there is one thing publishers do well, and sometimes to exaggerated fidelity, it is quality assurance.

However, AI tools offer several new dimensions which can bring machine assistance into many more parts of the process at a much larger scale. This, together with the feeling we have that AI is, in fact, in some ways ‘doing work previously considered to be the sole realm of the sentient’ and the need for people and AI machines to ‘learn together’ so those outcomes can improve, means there is both factual and emotional requirements to scope, monitor, and check these outcomes.

Consequently, workflow platforms must be designed with interfaces allowing seamless ‘Human QA’ at appropriate points in the process. These interfaces should enable publishers to review, edit, and approve AI-generated content or insights, ensuring that the final product meets the required standards and ethical guidelines. Where possible, the ‘Human QA’ should feed back into the AI processes to improve future outcomes; this also needs to be considered by tool builders.

To accommodate this 'Human QA', new types of interfaces will need to be developed in publishing tools. These interfaces should facilitate easy interaction between human users and AI-generated content, allowing for necessary reviews and modifications. For instance, a journal workflow platform might offer a feature where users are asked to 'greenlight' a pre-selected option from a drop-down menu (e.g., institutional affiliation), generated by AI. This way, researchers and editors can quickly validate AI-generated suggestions while providing feedback to improve the AI's performance over time. Integrating such interfaces not only ensures that the content adheres to the desired quality standards and ethical principles but also expedites the publishing process, making it more efficient.

The Speed of Trust

Trust plays a large role in this process. As we learn more about the fidelity and accuracy of these systems and confront what AI processes can and can’t do well to date, we will need to move forward with building AI into workflows 'at the speed of trust.'

Adopting a \\\"speed of trust\\\" approach means being cautious yet open to AI's potential in transforming publishing workflows. It involves engaging in honest conversations about AI's capabilities and addressing concerns, all while striking a balance between innovation and desirable community standards. As we navigate this delicate balance, we create an environment where AI technology can grow and adapt to better serve the publishing community.

For example, as a start, when integrating AI into publishing workflows, we believe it is essential to provide an ‘opt-in’ and transparent approach to AI contributions. Publishers and authors should be informed about the extent of AI involvement and its limitations, and presented with interfaces allowing them to make informed decisions about when and how AI will be used. This transparent ‘opt-in’ approach helps build trust, allows us to iterate forward as we gain more experience, and sets the stage for discussions and practices regarding ethical AI integration in publishing workflows.

Conclusion

The potential of AI in publishing workflows is immense, and we find ourselves at a time when the technology has taken a significant step forward. But it's essential to approach its integration with a balanced perspective. We can harness the power of AI while adhering to ethical standards and delivering high-quality content by considering both the benefits and drawbacks of AI, identifying areas for human intervention, maintaining transparency, and evolving our understanding of AI contributions.

This initial taxonomy outlined in this article can serve as a starting point for understanding how AI can contribute to publishing workflows. By quantifying AI contributions in this way, we can also discuss the ethical boundaries of AI-assisted workflows more clearly and help publishers make informed decisions about AI integration.

By adopting a thoughtful strategy, the combined strengths of AI and human expertise can drive significant advancements and innovation within the publishing industry.

¹ It's worth noting that we use the term AI here, but we are actually referring to large language models (LLMs); AI serves as useful shorthand since it's the common term used in our community. As we all gain more experience, being more accurate about how we use terms like AI and LLM will become increasingly important. A Large Language Model (LLM) can be described as a sophisticated text processor. It's an advanced machine learning model designed to process, generate, and understand natural language text.

² By publishing, we are referring to both traditional journal-focused publishing models as well as emergent publishing models such as preprints, protocols/methods, micropubs, data, etc.

Many thanks to Ben Whitmore, Ryan Dix-Peek, and Nokome Bentley for the discussions that lead to this taxonomy at our recent Coko Summit. This article was written with the assistance of GPT4.

Is there a way to combine the strengths of science blogs with the more traditional ways of publishing science? What if we add what is missing but keep what works well?

John Maeda’s March 2023 SXSW Keynote on AI

Reid Hoffman wrote the first book with/about GPT-4: Impromptu.

Stephen Wolfram gives us a long read on how GPT works

Turn Chat GPT-4 into a teacher

Bruce Caron committed fiction about an AI-enhanced educational game

Research software is a component of our instruments

Research software is the instrument

Research software analyses research data

Research software presents research results

Research software assembles or integrates existing components into a working whole

Research software is infrastructure or an underlying tool

Research software facilitates distinctively research-oriented collaboration

Summary

Pre-processing of data

Cleaning the data

Methods

TF-IDF + Linear Support Vector Classification

Embeddings + Linear Support Vector Classification

OpenAI: LLM + sentence completion

Summary

Research Software Alliance (ReSA)

Activities

Funders Forum

Community Forum

Task Forces

Participate

References

Why is Academia so Fascinated with IP?

Transferring Technologies

Using IP to Transfer Technology

Transferring Technology in the Open

Hardware and Equipment

Software

Does it Really Work?

New Paths (and the Need for More of Them)

1. Knowledge discovery: Getting through papers faster

2. Communication enhancement: Articulating yourself better

3. Data analysis acceleration: Making sense of data faster

4. Publishing efficiency: Expediting the workflow

Final thoughts

Preprints

Preprint peer review

Future development - three scenarios

Scenario 1: The mixed system

Scenario 2: The extended journal

Scenario 3: Moving away from journal-based publishing

Publish Your Reviews

Next steps

Indexing of preprints in bibliographic databases

Challenges of indexing preprints

Recommendations for indexing preprints

Improving indexing of preprints - The need for joint action

Outlook

If we did it right, what would it look like

Committing resources and intentional work - and not just showing up

Infrastructure: what’s at stake

The Latin American perspective

Scholarly AI Taxonomy

Initial categorization

1. Extract - Identify and isolate specific entities or data points within the content

2. Validate - Verify the accuracy and reliability of the information

3. Generate - Produce new content or ideas, such as text or images

4. Analyse - Examine patterns, relationships, or trends within the information

5. Reformat - Modify and adjust information to fit specific formats or presentation styles

6. Discover - Search for and locate relevant information or connections

7. Translate - Convert information from one language or form to another

Balancing AI and Human Intervention in Publishing Workflows

The Speed of Trust

Conclusion