62 stories
·
0 followers

Inside the AI Prompts DOGE Used to “Munch” Contracts Related to Veterans’ Health

1 Share

ProPublica is a nonprofit newsroom that investigates abuses of power. Sign up to receive our biggest stories as soon as they’re published.

When an AI script written by a Department of Government Efficiency employee came across a contract for internet service, it flagged it as cancelable. Not because it was waste, fraud or abuse — the Department of Veterans Affairs needs internet connectivity after all — but because the model was given unclear and conflicting instructions.

Sahil Lavingia, who wrote the code, told it to cancel, or in his words “munch,” anything that wasn’t “directly supporting patient care.” Unfortunately, neither Lavingia nor the model had the knowledge required to make such determinations.

Sahil Lavingia at his office in Brooklyn (Ben Sklar for ProPublica)

“I think that mistakes were made,” said Lavingia, who worked at DOGE for nearly two months, in an interview with ProPublica. “I’m sure mistakes were made. Mistakes are always made.”

It turns out, a lot of mistakes were made as DOGE and the VA rushed to implement President Donald Trump’s February executive order mandating all of the VA’s contracts be reviewed within 30 days.

ProPublica obtained the code and prompts — the instructions given to the AI model — used to review the contracts and interviewed Lavingia and experts in both AI and government procurement. We are publishing an analysis of those prompts to help the public understand how this technology is being deployed in the federal government.

The experts found numerous and troubling flaws: the code relied on older, general-purpose models not suited for the task; the model hallucinated contract amounts, deciding around 1,100 of the agreements were each worth $34 million when they were sometimes worth thousands; and the AI did not analyze the entire text of contracts. Most experts said that, in addition to the technical issues, using off-the-shelf AI models for the task — with little context on how the VA works — should have been a nonstarter.

Lavingia, a software engineer enlisted by DOGE, acknowledged there were flaws in what he created and blamed, in part, a lack of time and proper tools. He also stressed that he knew his list of what he called “MUNCHABLE” contracts would be vetted by others before a final decision was made.

Portions of the prompt are pasted below along with commentary from experts we interviewed. Lavingia published a complete version of it on his personal GitHub account.

Problems with how the model was constructed can be detected from the very opening lines of code, where the DOGE employee instructs the model how to behave:

You are an AI assistant that analyzes government contracts. Always provide comprehensive few-sentence descriptions that explain WHO the contract is with, WHAT specific services/products are provided, and WHO benefits from these services. Remember that contracts for EMR systems and healthcare IT infrastructure directly supporting patient care should be classified as NOT munchable. Contracts related to diversity, equity, and inclusion (DEI) initiatives or services that could be easily handled by in-house W2 employees should be classified as MUNCHABLE. Consider 'soft services' like healthcare technology management, data management, administrative consulting, portfolio management, case management, and product catalog management as MUNCHABLE. For contract modifications, mark the munchable status as 'N/A'. For IDIQ contracts, be more aggressive about termination unless they are for core medical services or benefits processing.

This part of the prompt, known as a system prompt, is intended to shape the overall behavior of the large language model, or LLM, the technology behind AI bots like ChatGPT. In this case, it was used before both steps of the process: first, before Lavingia used it to obtain information like contract amounts; then, before determining if a contract should be canceled.

Including information not related to the task at hand can confuse AI. At this point, it’s only being asked to gather information from the text of the contract. Everything related to “munchable status,” “soft-services” or “DEI” is irrelevant. Experts told ProPublica that trying to fix issues by adding more instructions can actually have the opposite effect — especially if they’re irrelevant.

Analyze the following contract text and extract the basic information below. If you can't find specific information, write "Not found".

CONTRACT TEXT: {text[:10000]} # Using first 10000 chars to stay within token limits

The models were only shown the first 10,000 characters from each document, or approximately 2,500 words. Experts were confused by this, noting that OpenAI models support inputs over 50 times that size. Lavingia said that he had to use an older AI model that the VA had already signed a contract for.

Please extract the following information: 1. Contract Number/PIID 2. Parent Contract Number (if this is a child contract) 3. Contract Description - IMPORTANT: Provide a DETAILED 1-2 sentence description that clearly explains what the contract is for. Include WHO the vendor is, WHAT specific products or services they provide, and WHO the end recipients or beneficiaries are. For example, instead of "Custom powered wheelchair", write "Contract with XYZ Medical Equipment Provider to supply custom-powered wheelchairs and related maintenance services to veteran patients at VA medical centers." 4. Vendor Name 5. Total Contract Value (in USD) 6. FY 25 Value (in USD) 7. Remaining Obligations (in USD) 8. Contracting Officer Name 9. Is this an IDIQ contract? (true/false) 10. Is this a modification? (true/false)

This portion of the prompt instructs the AI to extract the contract number and other key details of a contract, such as the “total contract value.”

This was error-prone and not necessary, as accurate contract information can already be found in publicly available databases like USASpending. In some cases, this led to the AI system being given an outdated version of a contract, which led to it reporting a misleadingly large contract amount. In other cases, the model mistakenly pulled an irrelevant number from the page instead of the contract value.

“They are looking for information where it’s easy to get, rather than where it’s correct,” said Waldo Jaquith, a former Obama appointee who oversaw IT contracting at the Treasury Department. “This is the lazy approach to gathering the information that they want. It’s faster, but it’s less accurate.”

Lavingia acknowledged that this approach led to errors but said that those errors were later corrected by VA staff.

Once the program extracted this information, it ran a second pass to determine if the contract was “munchable.”

Based on the following contract information, determine if this contract is "munchable" based on these criteria:

CONTRACT INFORMATION: {text[:10000]} # Using first 10000 chars to stay within token limits

Again, only the first 10,000 characters were shown to the model. As a result, the munchable determination was based purely on the first few pages of the contract document.

Then, evaluate if this contract is "munchable" based on these criteria: - If this is a contract modification, mark it as "N/A" for munchable status - If this is an IDIQ contract:   * For medical devices/equipment: NOT MUNCHABLE   * For recruiting/staffing: MUNCHABLE   * For other services: Consider termination if not core medical/benefits - Level 0: Direct patient care (e.g., bedside nurse) - NOT MUNCHABLE - Level 1: Necessary consultants that can't be insourced - NOT MUNCHABLE

The above prompt section is the first set of instructions telling the AI how to flag contracts. The prompt provides little explanation of what it’s looking for, failing to define what qualifies as “core medical/benefits” and lacking information about what a “necessary consultant” is.

For the types of models the DOGE analysis used, including all the necessary information to make an accurate determination is critical.

Cary Coglianese, a University of Pennsylvania professor who studies the governmental use of artificial intelligence, said that knowing which jobs could be done in-house “calls for a very sophisticated understanding of medical care, of institutional management, of availability of human resources” that the model does not have.

- Contracts related to "diversity, equity, and inclusion" (DEI) initiatives - MUNCHABLE

The prompt above tries to implement a fundamental policy of the Trump administration: killing all DEI programs. But the prompt fails to include a definition of what DEI is, leaving the model to decide.

Despite the instruction to cancel DEI-related contracts, very few were flagged for this reason. Procurement experts noted that it’s very unlikely for information like this to be found in the first few pages of a contract.

- Level 2+: Multiple layers removed from veterans care - MUNCHABLE - Services that could easily be replaced by in-house W2 employees - MUNCHABLE

These two lines — which experts say were poorly defined — carried the most weight in the DOGE analysis. The response from the AI frequently cited these reasons as the justification for munchability. Nearly every justification included a form of the phrase “direct patient care,” and in a third of cases the model flagged contracts because it stated the services could be handled in-house.

The poorly defined requirements led to several contracts for VA office internet services being flagged for cancellation. In one justification, the model had this to say:

The contract provides data services for internet connectivity, which is an IT infrastructure service that is multiple layers removed from direct clinical patient care and could likely be performed in-house, making it classified as munchable.

IMPORTANT EXCEPTIONS - These are NOT MUNCHABLE: - Third-party financial audits and compliance reviews - Medical equipment audits and certifications (e.g., MRI, CT scan, nuclear medicine equipment) - Nuclear physics and radiation safety audits for medical equipment - Medical device safety and compliance audits - Healthcare facility accreditation reviews - Clinical trial audits and monitoring - Medical billing and coding compliance audits - Healthcare fraud and abuse investigations - Medical records privacy and security audits - Healthcare quality assurance reviews - Community Living Center (CLC) surveys and inspections - State Veterans Home surveys and inspections - Long-term care facility quality surveys - Nursing home resident safety and care quality reviews - Assisted living facility compliance surveys - Veteran housing quality and safety inspections - Residential care facility accreditation reviews

Despite these instructions, AI flagged many audit- and compliance-related contracts as “munchable,” labeling them as “soft services.”

In one case, the model even acknowledged the importance of compliance while flagging a contract for cancellation, stating: “Although essential to ensuring accurate medical records and billing, these services are an administrative support function (a ‘soft service’) rather than direct patient care.”

Key considerations: - Direct patient care involves: physical examinations, medical procedures, medication administration - Distinguish between medical/clinical and psychosocial support

Shobita Parthasarathy, professor of public policy and director of the Science, Technology, and Public Policy Program at University of Michigan, told ProPublica that this piece of the prompt was notable in that it instructs the model to “distinguish” between the two types of services without instructing the model what to save and what to kill.

The emphasis on “direct patient care” is reflected in how often the AI cited it in its recommendations, even when the model did not have any information about a contract. In one instance where it labeled every field “not found,” it still decided the contract was munchable. It gave this reason:

Without evidence that it involves essential medical procedures or direct clinical support, and assuming the contract is for administrative or related support services, it meets the criteria for being classified as munchable.

In reality, this contract was for the preventative maintenance of important safety devices known as ceiling lifts at VA medical centers, including three sites in Maryland. The contract itself stated:

Ceiling Lifts are used by employees to reposition patients during their care. They are critical safety devices for employees and patients, and must be maintained and inspected appropriately.

Specific services that should be classified as MUNCHABLE (these are "soft services" or consulting-type services): - Healthcare technology management (HTM) services - Data Commons Software as a Service (SaaS) - Administrative management and consulting services - Data management and analytics services - Product catalog or listing management - Planning and transition support services - Portfolio management services - Operational management review - Technology guides and alerts services - Case management administrative services - Case abstracts, casefinding, follow-up services - Enterprise-level portfolio management - Support for specific initiatives (like PACT Act) - Administrative updates to product information - Research data management platforms or repositories - Drug/pharmaceutical lifecycle management and pricing analysis - Backup Contracting Officer's Representatives (CORs) or administrative oversight roles - Modernization and renovation extensions not directly tied to patient care - DEI (Diversity, Equity, Inclusion) initiatives - Climate & Sustainability programs - Consulting & Research Services - Non-Performing/Non-Essential Contracts - Recruitment Services

This portion of the prompt attempts to define “soft services.” It uses many highly specific examples but also throws in vague categories without definitions like “non-performing/non-essential contracts.”

Experts said that in order for a model to properly determine this, it would need to be given information about the essential activities and what’s required to support them.

Important clarifications based on past analysis errors: 2. Lifecycle management of drugs/pharmaceuticals IS MUNCHABLE (different from direct supply) 3. Backup administrative roles (like alternate CORs) ARE MUNCHABLE as they create duplicative work 4. Contract extensions for renovations/modernization ARE MUNCHABLE unless directly tied to patient care

This section of the prompt was the result of analysis by Lavingia and other DOGE staff, Lavingia explained. “This is probably from a session where I ran a prior version of the script that most likely a DOGE person was like, ‘It’s not being aggressive enough.’ I don’t know why it starts with a 2. I guess I disagreed with one of them, and so we only put 2, 3 and 4 here.”

Notably, our review found that the only clarifications related to past errors were related to scenarios where the model wasn’t flagging enough contracts for cancellation.

Direct patient care that is NOT MUNCHABLE includes: - Conducting physical examinations - Administering medications and treatments - Performing medical procedures and interventions - Monitoring and assessing patient responses - Supply of actual medical products (pharmaceuticals, medical equipment) - Maintenance of critical medical equipment - Custom medical devices (wheelchairs, prosthetics) - Essential therapeutic services with proven efficacy

For maintenance contracts, consider whether pricing appears reasonable. If maintenance costs seem excessive, flag them as potentially over-priced despite being necessary.

This section of the prompt provides the most detail about what constitutes “direct patient care.” While it does cover many aspects of care, it still leaves a lot of ambiguity and forces the model to make its own judgements about what constitutes “proven efficacy” and “critical” medical equipment.

In addition to the limited information given on what constitutes direct patient care, there is no information about how to determine if a price is “reasonable,” especially since the LLM only sees the first few pages of the document. The models lack knowledge about what’s normal for government contracts.

“I just do not understand how it would be possible. This is hard for a human to figure out,” Jaquith said about whether AI could accurately determine if a contract was reasonably priced. “I don’t see any way that an LLM could know this without a lot of really specialized training.”

Services that can be easily insourced (MUNCHABLE): - Video production and multimedia services - Customer support/call centers - PowerPoint/presentation creation - Recruiting and outreach services - Public affairs and communications - Administrative support - Basic IT support (non-specialized) - Content creation and writing - Training services (non-specialized) - Event planning and coordination

This section explicitly lists which tasks could be “easily insourced” by VA staff, and more than 500 different contracts were flagged as “munchable” for this reason.

“A larger issue with all of this is there seems to be an assumption here that contracts are almost inherently wasteful,” Coglianese said when shown this section of the prompt. “Other services, like the kinds that are here, are cheaper to contract for. In fact, these are exactly the sorts of things that we would not want to treat as ‘munchable.’” He went on to explain that insourcing some of these tasks could also “siphon human sources away from direct primary patient care.”

In an interview, Lavingia acknowledged some of these jobs might be better handled externally. “We don’t want to cut the ones that would make the VA less efficient or cause us to hire a bunch of people in-house,” Lavingia explained. “Which currently they can’t do because there’s a hiring freeze.”

The VA is standing behind its use of AI to examine contracts, calling it “a commonsense precedent.” And documents obtained by ProPublica suggest the VA is looking at additional ways AI can be deployed. A March email from a top VA official to DOGE stated:

Today, VA receives over 2 million disability claims per year, and the average time for a decision is 130 days. We believe that key technical improvements (including AI and other automation), combined with Veteran-first process/culture changes pushed from our Secretary’s office could dramatically improve this. A small existing pilot in this space has resulted in 3% of recent claims being processed in less than 30 days. Our mission is to figure out how to grow from 3% to 30% and then upwards such that only the most complex claims take more than a few days.

If you have any information about the misuse or abuse of AI within government agencies, reach out to us via our Signal or SecureDrop channels.

If you’d like to talk to someone specific, Brandon Roberts is an investigative journalist on the news applications team and has a wealth of experience using and dissecting artificial intelligence. He can be reached on Signal @brandonrobertz.01 or by email brandon.roberts@propublica.org.

Read the whole story
williampietri
5 days ago
reply
Share this story
Delete

isn’t it crazy that a woman being gender nonconforming literally just requires her to exist in her…

1 Share

hot-on-my-watch:

hot-on-my-watch:

tannisroute:

tannisroute:

isn’t it crazy that a woman being gender nonconforming literally just requires her to exist in her own body without making any changes whatsoever. why does the fact that i don’t wear makeup and i don’t shave and i don’t wear a bra have to be some political act. why can’t i just fucking exist

it is Exactly this kind of thinking that inspired this post lol

Alright Judith Butler, bit early in the day to be proving so conclusively that gender is at least in part a social construct isn’t it? 😅😅😅

And the autistic in me wants to make sure you know what “conformity” is.

But yes, strong agree.

Recently I realised that while me and various other women I know would be quietly delighted to be lazered up and so never again grow hair on our armpits, genitals or legs, my husband, who has never had a beard or moustache and does not intend to, would not make the same choice for his face. He was actually very surprised at me. Then I saw a post on Reddit where some man called his girlfriend a whore for having had the same body hair lazered off. Different worlds!

And I say this as someone who has had all their natural body hair for years now- disability baby!

To clarify and apologise because I misspoke, @tannisroute makes an extremely good point about the nature of gender conformity for pubescent girls and women, certainly in the West. Men express physical gender conformity by leaving their body much as it is, whereas women can only do it by actively altering ours in never-ending processes that consume much more time, energy and expense. As OP said:


Man trims only hair on head: conformity

Women trims only hair on head: non-conformity.


Man wears no makeup: conformity

Woman wears no makeup: non-conformity


For men, gender conformity is more often a lack of action, where for us it is action itself.

Read the whole story
williampietri
14 days ago
reply
Share this story
Delete

Chasing the Electric Pangolin Open Thread

1 Share

A few months ago, I remember reading some press about a new economics preprint out of MIT. The Wall Street Journal covered the research a few days after it dropped online, with the favorable headline, “Will AI Help or Hurt Workers? One 26-Year-Old Found an Unexpected Answer.” The photo for the article shows the promising young author, Aidan Toner-Rodgers, standing next to two titans of economics research, Daron Acemoglu (2024 Nobel laureate in economics) and David Autor.

“It’s fantastic,” said Acemoglu.

“I was floored,” said Autor.

The Atlantic and Nature covered the research as well, with both publications seemingly stunned by the quality of the work. And indeed, the quality of the work was stunningly high! The article analyzes data from a randomized trial of over one thousand materials researchers at the R&D lab of a US-based firm who were given access to AI tools. Toner-Rodgers adeptly tracks the effect of access to these AI tools on:

  • The number of materials discovered by the researchers.

  • The number of patents filed on those new materials.

  • The number of new product prototypes developed based on those new materials.

  • The time-allocation of the researchers over time, split between experimentation, judgment, and ideation.

  • The sentiment towards AI of the researchers, before and after AI tool adoption.

Not only do each of these metrics show really clear effects, but Toner-Rodgers throws every tool in the book at exploring them, using a number of really sophisticated methodologies that must have taken tremendous effort and care:

  • He calculates the quality of the new materials through a really elaborate algorithm that measures the distance from the “target” properties for each material discovered.

  • He measures the structural similarity of the crystal structures of the new materials to current materials by calculating the difference in atomic positions. This is really hard to do, even for materials scientists, let alone for economists!

  • He determines the novelty of patents using bigram analysis.

  • He uses a large language model (Claude 3.5) for the automated classification of research tasks.

At the time I saw the press coverage, I didn’t bother to click on the actual preprint and read the work. The results seemed unsurprising: when researchers were given access to AI tools, they became more productive. That sounds reasonable and expected.

Toner-Rodgers submitted his paper to The Quarterly Journal of Economics, the top econ journal in the world. His website said that he had received a “revise and resubmit” already, meaning that the article was probably well on its way to being published.

Unfortunately for everyone involved, the work is entirely fraudulent. MIT put out a press release this morning stating that they had conducted an internal, confidential review and that they have “no confidence in the veracity of the research contained in the paper.” The WSJ has covered this development as well. The econ department at MIT sent out an internal email so direly-worded on the matter that on first glance, students reading the email had assumed someone had died.

In retrospect, there had been omens and portents. I wish I had read the article at the time of publication, because I suspect my BS detector would have risen to an 11 out of 10 if I’d given it a close read. It really is the perfect subject for this blog: a fraudulent preprint called “Artificial Intelligence, Scientific Discovery, and Product Innovation,” with a focus on materials science research.

Hindsight is of course 20/20, but the first red flag that should have been raised is the source of the data itself. The article gives enough details to raise some intense curiosity. It’s a US-based firm that has (at least) 1,018 researchers devoted to materials discovery alone, an enormous amount. This narrows it down to a handful of firms. Initially the companies Apple, Intel, and 3M came to mind, but then I noticed this breakdown of the materials specialization of the researchers in the study:

This was bizarre to me, as very few companies do massive amounts of materials research and which also is split fairly evenly across the spectrum of materials, in disparate domains such as biomaterials and metal alloys. I did some “deep research” to confirm this hypothesis (thank you ChatGPT and Gemini) and I believe that there are a few companies that could plausibly meet this description: 3M, Dupont, Dow, and Corning. None of these are perfect fits, either, especially with the 32% share on metals and alloys.

I’ll really be embarrassing myself if it turns out that an actual R&D lab was supplying Toner-Rodgers with data and he was just fraudulently manipulating it, but I think this is quite unlikely, and it’s more plausible that the data was entirely fabricated to begin with. I have several reasons for believing this:

  • Why would a large company like this take such pains to run a randomized trial on its own employees, tracking a number of metrics of their performance, only to anonymously give this data to a single researcher from MIT—a first year PhD student, mind you—rather than publishing the findings themselves?

  • Even at those large R&D companies, only a small fraction of researchers are devoted to the task of “materials discovery,” and it seems implausible that a company would run an experiment on AI adoption on over a thousand employees in such a structured manner.

  • The description of the tasks that these employees do and the divisions between fields, and all the other information provided seems almost too neat to be true. Real companies don’t have hundreds of R&D teams each working on similar tasks, all of a similar size, all tracking the same metrics. It reads like how an economics student at MIT imagines R&D labs to be run if their only experience with such labs are from reading the top 1% of economics papers on innovation in research.

The next red flag should have been how spotless the findings were. In every domain that was explored, there was a fairly unambiguous result. New materials? Up by 44% (p<0.000). New patents? Up by 39% (p<0.000). New prototypes? Up by 17% (p<0.001).

The quality of the new materials? Up, and statistically significant. The novelty of the new materials? Up, and statistically significant. Did researchers who were previously more talented improve more from AI tool use? Yes. Were these results reflected in researchers self-assessments of their time allocation? Unambiguously yes. The plot for that last bit is every economist’s dream, a perfect encapsulation of the principle of comparative advantage taking effect:

And look how contrived and neat this other plot looks, showing whether researchers’ self-assessment of their judgment ability correlates with their survey response on the role of different domains of knowledge in AI materials discovery. Three out of four categories show a neat increase and one out of four remains constant (which is the one that from first principles seems like it wouldn’t matter, experience using other AI-evaluation tools).

This plot also makes no sense, when you think about it. Why would researchers with better judgment be systematically more likely to give higher numbers on this survey question on average?

Q3: On a scale of 1–10, how useful are each of the following in evaluating AI-suggested candidate materials (scientific training, experience with similar materials, intuition or gut feeling, and experience with similar tools)?

And then, to cap it off, here’s how Toner-Rodgers describes a fortuitous round of layoffs at the firm, that miraculously doesn’t interfere with the data collection for the primary analysis and yet contributes an insightful example that supports his findings:

“In the final month of my sample—excluded from the primary analysis—the firm restructured its research teams. The lab fired 3% of its researchers. At the same time, it more than offset these departures through increased hiring, expanding its workforce on net. While I do not observe the abilities of the new hires, those dismissed were significantly more likely to have weak judgment. Figure 13 shows the percent fired or reassigned by quartile of γˆ j. Scientists in the top three quartiles faced less than a 2% chance of being let go, while those in the bottom quartile had nearly a 10% chance.”

I mean, come on, be for real…

Share

Now, my background in materials science provides me a neat leg up, as I’d assume the vast majority of those reviewing/reading/following this paper are economists and people interested in the effects of AI use.

How do the parts of this paper that directly engage with materials science hold up? Well, they’re a little too clever. Take Toner-Rodgers’ analysis of “materials similarity” where he claimed to have used crystal structure calculations to determine how similar the new materials were to previously discovered materials. The plot is stunningly unambiguous, the new materials discovered with AI are more novel.

However, it boggles the mind that a random economics student at MIT would be able to easily (and without providing any further details), perform the highly sophisticated technique from the paper he cites (De et al, 2016), especially in this elegantly formalized manner without any domain expertise in computational materials research. This graph, and the data it represents, if true, would probably be worth a Nature paper on AI materials discovery on its own. In his paper, it’s relegated to the appendices.

This methodology also makes no sense at generalizing across different types of materials, so I have no clue how you could reduce the results from such broad classes of materials to a single figure of merit in this manner. The gaps between 0.0 and 0.2 and 0.8 and 1.0 might seem reasonable to someone who read a few papers and noticed similar gaps in a couple of the graphs, but it would be bizarre when generalized across several classes of materials, and the data is likely completely fabricated for this reason. To simplify this critique, a novel metal alloy would have a very different level of similarity from a reference class of previously-discovered alloys, than a novel polymer would from its own reference class. It would require some really sophisticated methodology to normalize this single figure of merit across material types, which Toner-Rodgers does not mention at all. Also, this would all be insanely challenging to implement using data from the Materials Project, requiring some sophisticated “big data” workflows. If you want a smoking gun, here’s a graph from a paper, Krieger et al, “Missing Novelty in Drug Development,” which Toner-Rodgers cites, using a similar methodology for drug discovery. It looks eerily similar to the distribution in this preprint. This distribution might make sense for drugs, but makes very little intuitive sense for a broad range of materials, with the figure of merit derived directly from the atomic positions in the crystal structure. This is the kind of mistake that someone with no domain expertise in materials science might make.

Toner-Rodgers’ treatment of “materials quality” would also probably drive a materials scientist insane if they were forced to think about it at length.

Here’s the equation he uses to calculate the “quality” of a new material:

This would likely be a case of extreme garbage in: garbage out. First of all, there are typically no “target features” that are easily reduced to single values, but also, even if there were, some of these would be distributed on a log scale, which would dramatically skew the values for certain classes of materials. Also, in general, the “quality” of a new material that an R&D lab develops is likely not at all related to improvements in the actual top-line figures of merit like “band gap” or “refractive index”, the two examples that Toner-Rodgers gives. Instead, they would be for things like durability, affordability, ease of manufacture, etc. These are all properties that are not easily reduced to a single value. And even if they were, good luck getting researchers to measure, systematize, and document these values for the new materials!

However, from this amalgam of gibberish, Toner-Rodgers manages to extract a significant finding anyway! All 1,018 scientists contribute to this endeavor, and statistically significant findings are reported in every single category:

Leave a comment

Some people might look at this saga and think “ah, another bs preprint, thankfully we have peer review to deal with it.” However, I think that were it not for the fact that this preprint had gained so much attention, this article would have slipped through peer review, only to embarrass the editors of the top econ journal in the world after being published and reported on.

Moreover, these are the kind of errors that the editorial process at an econ journal might not catch. I think the most clearly fraudulent components of the paper are those that seem to dramatically simplify the complexity of the materials work going into the paper. Robert Palgrave, who has been an outstanding critic in the past of skeptical work on AI materials discovery, has a twitter thread noting similar problems with the work (I promise I read his thread after writing the bulk of this blog post). And when the piece originally came out, he had an orthogonal, but also very valid set of reasons for being skeptical of the work (mostly due to the difficulty in defining the “novelty” of materials).

In general, the lesson I think we should learn is to be much more skeptical of these sorts of research findings. Learning new things about the world is hard, and generally randomized trials on such a complex topic should show much more ambiguous results. The fact that the data was so beautiful and fit such a perfect narrative should have raised alarm bells, rather than catapulting the results to international attention.

I also think that if comments were enabled on arxiv preprints, this could have led to a much more rapid conclusion to the fraud. Probably a materials scientist who read the paper realized this was fraudulent but wasn’t able to get that view quickly to the economists who were actually reading and discussing the paper. A well-written arxiv comment explaining why the data on materials similarity, for example, couldn’t be true, would have gone a long way.

After writing a draft of this blog post, I saw this tweet which says that Corning, this January, filed an IP complaint with the WIPO against Toner-Rodgers for registering a domain name called “corningresearch.com”.

This validates my earlier guess as to which companies’ data this might plausibly be. However, it looks like Toner-Rodgers may have been using this website to privately substantiate his fake data, without the knowledge of Corning? I’m not sure what this means, but it’s certainly interesting. It’s possible he was using the domain name to send fake emails to himself, or to generate pdf files at plausible-sounding urls, to show his advisor. Corning is a great company, and if they actually did collect this data and evaluate the materials properties in some coherent manner, that’s extremely impressive. However, I still think it’s far more likely that the data was completely fabricated by Toner-Rodgers.

Read the whole story
williampietri
26 days ago
reply
Share this story
Delete

Lessons From the Newark Debacle

1 Share

I did something either brave or foolish last week. I was booked on a flight from Amsterdam to Newark, and decided to ignore advice from friends urging me to rebook on a plane heading someplace else. And a strange thing happened: my flight arrived right on schedule.

Obviously thousands of flyers have been having very different experiences in recent weeks, and air traffic control at Newark remains a mess. So what can we learn from the debacle?

I’d like to blame Elon Musk and say that all those delayed travelers have been DOGEd. Sadly, the problems at Newark, and with air traffic control in general, have been building for many months. So you can’t blame this problem on the Muskenjugend — the tech bros barely old enough to shave that DOGE has parachuted into many government agencies — even though they are indeed wreaking havoc and will be responsible for many future debacles.

That said, the Newark mess is an object lesson in what’s wrong with DOGE and right-wing views of government in general.

The proximate causes of the current crisis, as I understand it, go like this: The Federal Aviation Administration as a whole is severely understaffed, with a dangerous shortage of air traffic controllers in particular, as well as relying on antiquated equipment — we’re talking Windows 95 and floppy disks. Recruiting controllers for the New York area has been especially hard because of the high cost of living (which is mainly about housing.) In an effort to improve recruitment, the FAA moved traffic control to Philadelphia, where the cost of living is substantially lower.

But many controllers refused to make the move, and the technology side of the transition was botched — apparently the Philadelphia center’s jerry-rigged link to radar and communications keeps going down, and some of the controllers in Philadelphia have been so traumatized that they have exercised their right to take leaves of absence, worsening the staff crisis.

Ordinarily I’d say that we’ll eventually have the full story of what went wrong and find ways to fix it. But maybe not. Do you trust Trump administration officials to conduct a full and honest inquiry rather than look for ways to blame the Biden administration and/or the traffic controllers? Do you trust them to look for real solutions rather than justifications for privatization and sweetheart contracts for supporters?

I was struck by Sean Duffy, the transportation secretary, declaring that “patriotic controllers are going to stay on and continue to serve the country.” This from an administration that has taken self-dealing to levels unimagined in our nation’s history.

But back to DOGE and all that. The whole premise underlying Muskification is that much of the federal workforce is deadwood — legions of overpaid bureaucrats pushing paper around without doing anything useful. In reality, however, many federal workers are like air traffic controllers — doing jobs that are essential to keeping the economy and normal life in general proceeding smoothly. And while the air traffic controller shortage is probably (I hope!) exceptionally severe, the federal bureaucracy is in general stretched thin after decades of anti-government rhetoric that have left federal employment as a share of total employment far below historical levels:

And if you’re wondering why the government is having trouble recruiting enough traffic controllers, you should know that the Congressional Budget Office has found that highly educated federal workers are, on average, paid less than equivalent workers in the private sector. Workers with a doctorate or professional degree are paid 29 percent less than their private-sector counterparts:

Source: Congressional Budget Office

And this gap has widened in recent years, because Congress has capped federal salary increases.

This matches my personal observation. The federal workers I know tend to be in economics or finance-related jobs, and they earn less — sometimes far less — than they could make if they went to Wall Street.

Why, then, do highly educated Americans even take federal jobs? CBO stresses job security, which has indeed historically been higher for federal workers than their private-sector counterparts. I would also say, based on those I know, that meaning is a factor. At least some high-level federal workers accept lower pay than they could make elsewhere because they feel that they’re doing something that matters. No doubt that’s only a relatively small subset of the federal work force, but it’s surely an important subset, people who are doing especially crucial jobs.

But that was the way things used to be. How much job security can high-level federal workers feel when they never know when they’ll be DOGEd — abruptly fired without notice, locked out of their offices and even their email accounts? How much pride can they take in their work when their political masters never miss a chance to say that they’re worthless (unless there’s a crisis, in which case it becomes their patriotic duty to stay on the job?).

So my prediction is that the air traffic control crisis is the shape of things to come. In a matter of months Trump, Musk and company have severely degraded the morale and, eventually, the quality of the federal work force. And the result will be many more debacles.

MUSICAL CODA

Read the whole story
williampietri
29 days ago
reply
Share this story
Delete

Wrigley Field Bleacher Goose Makes A Nest, Forcing The Closure Of A Section Of Seats

1 Share
This Canada goose decided to nest in Wrigley Field's bleachers, forcing a section under the scoreboard to be cordoned off.

WRIGLEY FIELD — There’s a new bleacher bum at Wrigley Field.

A Canada goose took up residence in a planter box in deep centerfield, forcing the Cubs to block off a section of bleacher seats Saturday to keep the peace.

The goose made a nest under the famed centerfield scoreboard’s right side. Its black and white head poked out over the evergreens in the concrete planter, surveying the 35,000 fans on hand.

Dozens of bleacher seats were cordoned off with stanchions, signs and ballpark operations staff assigned to keep guard.

“They’re calling us the geesekeepers,” one said.

The special section of empty seats could be seen from throughout the ballpark, a goose island of sorts that’s on brand with the Cubs’ beer offerings.

The goose apparently was in the nest for Friday’s home opener before the issue was identified. Its presence caused a stir, leading to the creation of the Geesekeeper Patrol and special section for Saturday’s game.

Fans took photos and had a laugh at the latest bleacher legend.

“The drunker they got, the funnier they were,” said one observer.

It’s not clear if the goose is protecting a nest of eggs. No one wanted to get close enough to investigate. This is nesting season for Canada geese, with females nesting on the eggs and males keeping guard nearby. Eggs typically hatch within a month. Friday was the first game of the season, so it’s not clear how long the goose has been there.

The Cubs won both Friday and Saturday, making the goose 2-0 and perhaps an essential mascot.

After Saturday’s game, the goose appeared pretty content, looking out over the ballpark as gulls swarmed the bleachers below, looking for edible bleacher debris.

Has a name been given?

“They were calling her Suzuki,” one guard said, a nod to Cubs slugger Seiya Suzuki.


Support Local News!

Subscribe to Block Club Chicago, an independent, 501(c)(3), journalist-run newsroom. Every dime we make funds reporting from Chicago’s neighborhoods. Already subscribe? Click here to gift a subscription, or you can support Block Club with a tax-deductible donation.

Listen to the Block Club Chicago podcast:

Read the whole story
williampietri
68 days ago
reply
Share this story
Delete

targetedknowledge:

1 Share

targetedknowledge:

Read the whole story
williampietri
98 days ago
reply
Share this story
Delete
Next Page of Stories