ML & AI news of the week

Photo by Priscilla Du Preez 🇨🇦 on Unsplash

A collection of the best ML & AI news every week (research, news, resources). Star this repository if you find it useful.

Here, you can find articles and tutorials about artificial intelligence

For each week you will find different sections:

Research: the most important published research of the week.
News: the most important news related to companies, institutions, and much more.
Resources: released resources for artificial intelligence and machine learning.
Perspectives: a collection of deep and informative articles about open questions in artificial intelligence.

and a meme for starting well the week.

Suggestions and corrections

Feel free to open an issue if you find some errors, if you have any suggestions, topics, or any other comments

Index

2024

2023

Back to index

2024

ML news: Week 14 - 20 October

Research

Link	description
Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models.	Introduces a novel RAG method to address the challenges of imperfect retrieval augmentation and knowledge conflicts in LLMs. Astute RAG adaptively extracts critical information from the internal knowledge of LLMs, then iteratively merges this with external knowledge while maintaining source awareness. Its interactive consolidation mechanism enhances the integration of internal and external information by identifying consistent passages, detecting conflicting data, and filtering out irrelevant content.
ToolGen: Unified Tool Retrieval and Calling via Generation.	Incorporates tool knowledge directly into LLMs by encoding tools as unique tokens, allowing the model to generate tool calls and arguments, facilitating smooth tool invocation alongside natural language generation. Experiments involving over 47,000 tools demonstrate that ToolGen outperforms in both tool retrieval and autonomous task execution.
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG.	Finds that in many long-context LLMs, output quality diminishes as the number of passages increases, with the performance decline attributed to retrieved hard negatives. The authors propose two methods to enhance long-context LLM-based RAG: retrieval reordering and RAG-specific tuning with intermediate reasoning to improve relevance identification. These approaches show marked improvements in both accuracy and robustness in long-context RAG performance.
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models.	Evaluates several state-of-the-art (SoTA) models using a benchmark built with symbolic templates that allow for a range of mathematical problems. The results show that LLMs display variability when answering different versions of the same questions, and their performance drops when numerical values in the questions are adjusted. As the complexity of the questions increases (e.g., adding more clauses), performance deteriorates significantly. The authors suggest that this decline in performance is likely due to a lack of logical reasoning capabilities in current LLMs.
Addition is All You Need for Energy-efficient Language Models.	Introduces an algorithm that approximates floating-point multiplication using integer addition operations, making it computationally less intensive than 8-bit floating-point arithmetic while achieving higher precision. The authors report that implementing the proposed L-Mul operation in tensor processing hardware could potentially reduce energy consumption by 95% for elementwise floating-point tensor multiplications and by 80% for dot product operations.
I Want to Break Free! Anti-Social Behavior and Persuasion Ability of LLMs in Multi-Agent Settings with Social Hierarchy.	Examines the interaction patterns of LLMs within a multi-agent setting involving a social hierarchy, specifically in a scenario where a guard and a prisoner interact, with the prisoner either seeking extra yard time or attempting to escape. The study finds that when power dynamics are present, LLMs struggle to maintain coherent conversations. Additionally, the authors highlight that agents' personas significantly influence their behaviors. Interestingly, even without explicit prompting, merely assigning roles to agents resulted in the emergence of anti-social behaviors.
Were RNNs All We Needed?	The paper revisits RNNs and demonstrates that removing the hidden states from the input, forget, and update gates allows for efficient parallel training. This adjustment eliminates the need for architectures like LSTMs and GRUs to rely on backpropagation through time (BPTT). They introduce new variants, called minLSTMs and minGRUs, which are 175 times faster for sequences of length 512.
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations.	The study finds that "truthfulness" information in LLMs is concentrated in specific tokens, offering a way to improve error detection and address related challenges. They also suggest that the internal representations of LLMs can be used to predict the types of errors these models are prone to making.
Archon: An Architecture Search Framework for Inference-Time Techniques.	The paper presents a modular framework for constructing and optimizing LLMs by integrating various inference-time techniques. This approach redefines the task of LLM system design as a hyperparameter optimization problem. Tested on benchmarks like MT-Bench and CodeContests, the framework, named Archon, outperforms top models such as GPT-4o and Claude 3.5 Sonnet, achieving a 15.1% average accuracy improvement.
RATIONALYST: Pre-training Process-Supervision for Improving Reasoning.	RATIONALYST is a model designed for process-supervision of reasoning, enabling it to generalize across a wide range of reasoning tasks. This is accomplished by pre-training on a dataset of 79k rationales from the Pile and a variety of reasoning datasets, with minimal human involvement. Fine-tuned from LLaMa-3-8B, the model achieves a 3.9% average accuracy improvement across seven reasoning benchmarks.
Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation.	The paper introduces a unified framework to evaluate an LLM’s capability to provide factual responses, assess retrieval skills, and reason through the generation of final answers. The framework includes multi-hop questions that require combining information from multiple sources. It reports that state-of-the-art LLMs struggle with this task, achieving only 40% accuracy without retrieval. However, the proposed multi-step retrieval method improves performance to 66% accuracy.
Not All LLM Reasoners Are Created Equal.	The paper introduces a unified framework to evaluate an LLM’s capability to provide factual responses, assess retrieval skills, and reason through the generation of final answers. The framework includes multi-hop questions that require combining information from multiple sources. It reports that state-of-the-art LLMs struggle with this task, achieving only 40% accuracy without retrieval. However, the proposed multi-step retrieval method improves performance to 66% accuracy.
Rejection Sampling IMLE: Designing Priors for Better Few-Shot Image Synthesis.	Training generative models like GANs with limited data is challenging. Existing Implicit Maximum Likelihood Estimation (IMLE) methods suffer from poor alignment between the latent codes used during training and those used during inference. The proposed approach, RS-IMLE, modifies the prior distribution during training, resulting in better test-time performance and higher-quality image generation.
Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models.	This study introduces a unified framework aimed at enhancing training stability in continuous-time consistency models, leading to substantial improvements in the performance of generative models.
DARNet: Dual Attention Refinement Network with Spatiotemporal Construction for Auditory Attention Detection.	DARNet is an innovative model for auditory attention detection (AAD) that improves the decoding of brain signals, such as EEG, by integrating spatiotemporal and dual attention mechanisms.
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads.	DuoAttention is a framework designed to optimize memory usage and reduce latency in long-context large language models (LLMs) by selectively applying full key-value (KV) caching to only the most essential attention heads.
Meta-DT: Offline Meta-RL as Conditional Sequence Modeling with World Model Disentanglement.	Meta Decision Transformer (Meta-DT) aims to enhance generalization in reinforcement learning by integrating transformer-based sequential modeling with effective task representation learning.

News

Link	description
AI gives voice to dead animals in Cambridge exhibition.	Creatures can converse and share their stories by voice or text through visitors’ mobile phones at Museum of Zoology
Three-armed robot conductor makes debut in Dresden.	German city’s Sinfoniker says the aim is not to replace humans but to play music human conductors would find impossible
Tesla’s value drops $60bn after investors fail to hail self-driving ‘Cybercab’.	Analysts criticize lack of detail about the ‘robotaxi’ showcased by CEO Elon Musk
Microsoft may have an audio-to-image generator in the works, new patent shows.	Microsoft has submitted a patent for an AI system that transforms live audio into images using large language models (LLMs). The system is intended to improve communication by creating real-time visuals from audio streams. Once developed, it could potentially be incorporated into Microsoft Teams through Copilot integration.
Australia’s spy chief warns AI will accelerate online radicalization.	Asio boss Mike Burgess says social media impact is a ‘step-change’ in the threat posed by extremism
Google to buy nuclear power for AI datacentres in ‘world first’ deal.	Tech company orders six or seven small nuclear reactors from California’s Kairos Power
Silicon Valley is debating if AI weapons should be allowed to decide to kill.	In late September, Shield AI co-founder Brandon Tseng swore that weapons in the U.S. would never be fully autonomous — meaning an AI algorithm would make the final decision to kill someone. “Congress doesn’t want that,” the defense tech founder told TechCrunch. “No one wants that.”
Zoom’s custom AI avatar tool may come with risks.	The upcoming feature, announced today at Zoom’s annual dev conference, will translate a video clip that users record of themselves into a digital clone — complete with a head, upper arms, and shoulders. Users will be able to type a script of what they want the digital double to say, and Zoom will generate audio that syncs with the avatar’s lip movements.
Generate Video (beta) on Firefly Web App.	During the Adobe MAX conference, Adobe revealed the extension of its Firefly series of creative generative AI models to include video.
OpenAI appoints international expansion boss.	OpenAI has named Oliver Jay as the head of its international expansion, with a focus on AI strategy and operations. The company also revealed the opening of a new APAC office in Singapore and is working on developing datasets for local languages. The o1 model, which incorporates "chain of thought" methods, is designed to improve AI accuracy.
Anthropic challenges OpenAI with affordable batch processing.	Anthropic has introduced a Message Batches API, enabling businesses to handle large data volumes at half the cost of traditional API calls. The API allows for up to 10,000 asynchronous queries within 24 hours, providing a cost-efficient solution by shifting AI processing from real-time to "right-time." This approach encourages AI adoption among mid-sized companies but may draw attention away from the advancement of real-time AI capabilities.
OpenAI Projections Imply Losses Tripling To $14 Billion In 2026.	OpenAI projects losses to rise to $14 billion in 2026, with total losses reaching $44 billion by 2028.
AMD launches AI chip to rival Nvidia's Blackwell.	AMD has introduced the Instinct MI325X AI chip, targeting competition with Nvidia's leading data center GPUs.
Meta’s open AI hardware vision.	Meta unveiled its open AI hardware designs, including the Catalina rack and the enhanced Grand Teton platform, at the OCP Global Summit. Notably, training the Llama 3.1 405B model required 16,000 NVIDIA H100 GPUs, demonstrating Meta's robust scaling infrastructure. These open AI hardware systems are essential for driving further advancements in AI capabilities.
The New York Times warns AI search engine Perplexity to stop using its content.	The New York Times has sent a cease and desist letter to AI startup Perplexity, accusing the company of using its content without authorization for AI search purposes. Perplexity asserts that it does not scrape content for training but instead indexes web pages to provide factual information. The company is currently in discussions with publishers and seeks to resolve the matter by collaborating with the Times and other media organizations.
Decagon raises $65m Series B led by Bain Capital Ventures to bring total funding to $100m.	Decagon has secured $65 million in Series B funding to further develop its AI customer support agents, which are already utilized by companies such as Duolingo and Eventbrite to streamline customer interactions. These AI agents automate routine tasks, allowing customer support teams to focus on more strategic roles. The funding will be used to strengthen Decagon's engineering team and extend its AI solutions into new markets and industry sectors.
New high-quality AI video generator Pyramid Flow launches — and it’s fully open source!	The number of AI video generation models continues to grow with a new one, Pyramid Flow, launching this week and offering high-quality video clips up to 10 seconds in length — quickly, and all open source.
This three-person robotics startup is working with designer Yves Béhar to bring humanoids home.	Kind Humanoid's three-person team is developing a whimsical humanoid robot named Mona, specifically designed for home use rather than industrial applications. The team aims to conduct field tests with a dozen initial prototypes next year. Unlike many AI-driven robotics companies that focus on industrial markets and heavy fundraising, Kind prioritizes innovation and efficiency, setting its approach apart from competitors in the robotics space.
INTELLECT–1: Launching the First Decentralized Training of a 10B Parameter Model.	INTELLECT-1 is the first decentralized model with 10 billion parameters, designed to harness global contributions for open-source AGI development. It utilizes OpenDiLoCo scaling to train large models across distributed devices, with innovations in bandwidth efficiency and fault tolerance. The new Prime framework further enhances decentralized training by optimizing compute utilization, achieving a 98% utilization rate during INTELLECT-1's 10-billion-parameter training run. This marks a significant advancement in decentralized AI model training.
Elon Musk Shows Off Tesla ‘Robotaxi’ That Drives Itself.	“You could fall asleep and wake up at your destination,” said Mr. Musk, Tesla’s C.E.O., but some experts are skeptical that such cars will be ferrying passengers soon.
ByteDance lays off hundreds of TikTok employees in the shift to AI content moderation.	ByteDance’s TikTok is laying off hundreds of employees, mainly in Malaysia, according to Reuters. The cuts come as the social network is increasingly turning to AI for content moderation. The cuts do not impact employees in the U.S.
Microsoft Artificial Intelligence VP Bubeck to Join OpenAI.	Microsoft Corp. said one of its artificial intelligence vice presidents, Sebastien Bubeck, is leaving to join OpenAI, where Microsoft is both the largest investor and a rival.
‘It’s not me, it’s just my face’: the models who found their likenesses had been used in AI propaganda.	London-based Synthesia’s technology was employed to make deepfake videos for authoritarian regimes
Amazon.com joins push for nuclear power to meet data center demand.	Company says it signed three agreements on developing small modular reactor nuclear power technology
Un Ministral, des Ministraux.	On the first anniversary of Mistral 7B, Mistral launched two advanced models designed for on-device and edge computing: Ministral 3B and Ministral 8B. These models are optimized for tasks under 10 billion parameters, offering superior knowledge, reasoning, and efficiency. They also support a context length of up to 128k and deliver faster inference.
Former Palantir CISO Dane Stuckey joins OpenAI to lead security.	Dane Stuckey, the former CISO of analytics firm Palantir, has joined OpenAI as its newest CISO, serving alongside OpenAI head of security Matt Knight.
Can AI really compete with human data scientists? OpenAI’s new benchmark puts it to the test.	OpenAI has introduced a new tool to measure artificial intelligence capabilities in machine learning engineering. The benchmark, called MLE-bench, challenges AI systems with 75 real-world data science competitions from Kaggle, a popular platform for machine learning contests.
Adobe’s AI video model is here, and it’s already inside Premiere Pro.	New beta tools allow users to generate videos from images and prompts and extend existing clips in Premiere Pro.
Customize Audio Overviews with Google's NotebookLM.	NotebookLM now enables users to customize their Audio Overview experience, providing greater control over the areas of focus and expertise of the AI hosts. Companies can apply for the new NotebookLM Business pilot program, which includes improved tools designed for professional applications.
Combining next-token prediction and video diffusion in computer vision and robotics.	A new method can train a neural network to sort corrupted data while anticipating next steps. It can make flexible plans for robots, generate high-quality video, and help AI agents navigate digital environments.
Nvidia just dropped a new AI model that crushes OpenAI’s GPT-4—no big launch, just big results.	Nvidia quietly unveiled a new artificial intelligence model on Tuesday that outperforms offerings from industry leaders OpenAI and Anthropic, marking a significant shift in the company’s AI strategy and potentially reshaping the competitive landscape of the field.
Invisible text that AI chatbots understand and humans can’t? Yep, it’s a thing.	A quirk in the Unicode standard harbors an ideal steganographic code channel.
Google supercharges Shopping tab with AI and personalized recommendation feed.	After bringing generative AI to Search in 2023, Google is supercharging its Shopping tab with the technology. The company announced on Tuesday that it will use AI to help users shop for products based on exactly what they’re looking for. It also launched a new scrollable feed of personalized, shoppable products.
Adobe’s Project Super Sonic uses AI to generate sound effects for your videos.	Adobe's Project Super Sonic leverages text-to-audio technology, object recognition, and voice input to create audio effects for video projects.
White House considers expanding Nvidia’s and AMD’s AI chip export limits to additional countries.	The Biden administration is contemplating limitations on AI chip sales from Nvidia and AMD to countries in the Persian Gulf, citing national security concerns.

Resources

Link	description
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering.	It introduces a new benchmark to assess machine learning agents' proficiency in machine learning engineering tasks. The benchmark consists of 75 Kaggle competitions focused on key MLE skills, including model training, dataset preparation, and experiment execution. OpenAI's o1-preview model, utilizing the AIDE scaffolding, reaches a bronze medal level in 16.9% of the competitions.
Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System.	Presents a novel framework aimed at improving both communication efficiency and task effectiveness in LLM-based multi-agent systems through targeted LLM training. It introduces an iterative "generate, rank, select, and train" approach, enhanced by a reward function to optimize performance, token usage, and communication efficiency. The framework integrates Monte Carlo Tree Search-inspired techniques for DPO data generation, promoting diverse exploration. Experimental results show consistent improvements over single-agent baselines and standard multi-agent systems (MAS) using Llama 3 8B, achieving a 2.8x performance boost while utilizing fewer than 10% of tokens on tasks involving extensive information exchange.
Zyphra's Mamba 2 based model beats Mistral.	Introduces the first state space-style model that surpasses transformers at the 7B scale. It excels in understanding and generating long-context data, thanks to the linear time scaling of the Mamba 2 blocks, which significantly enhances its efficiency and performance.
OpenAI's Swarm.	OpenAI has introduced a lightweight framework designed to facilitate communication between agents. While it will not receive further updates, the framework could still offer valuable ideas and inspiration for future developments.
EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models.	EvolveDirector aims to develop a competitive text-to-image generation model using open, publicly available resources, avoiding the limitations imposed by proprietary models.
Rethinking the Evaluation of Visible and Infrared Image Fusion.	Researchers propose the Segmentation-oriented Evaluation Approach (SEA) to improve the evaluation of Visible and Infrared Image Fusion (VIF) techniques, which play a critical role in applications such as object detection and semantic segmentation.
A Gentle Introduction and Tutorial on Deep Generative Models in Transportation Research.	A gentle introduction and tutorial on deep generative models in transportation research provides a comprehensive overview of how these models can be applied to solve transportation problems.
Trans4D: Realistic Geometry-Aware Transition for Compositional Text-to-4D Synthesis.	Trans4D is a new framework developed to address the challenges of realistic 4D scene transitions, enhancing text-to-4D synthesis. It offers improved capabilities in generating coherent, dynamic 4D scenes from textual descriptions, making it more suitable for tasks that require accurate spatial and temporal scene transitions.
DocMTAgent.	DelTA, short for Document-levEL Translation Agent, is an online translation tool designed for handling document-level translations. It leverages a multi-level memory architecture to improve translation accuracy and coherence across larger texts, providing more context-aware translations compared to sentence-level models.
Fast Feedforward 3D Gaussian Splatting Compression.	Fast Compression of 3D Gaussian Splatting (FCGS) is a new model designed to eliminate the need for the slow, per-scene optimization required by earlier methods. Instead, FCGS achieves rapid compression using a quick feed-forward pass, reducing the processing time from minutes to just seconds. This significantly accelerates the compression process while maintaining high-quality results for 3D data.
OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling.	OneRef presents an optimized framework for referring segmentation by integrating visual and language feature spaces within a unified transformer architecture.
SmartPretrain: Model-Agnostic and Dataset-Agnostic Representation Learning for Motion Prediction.	SmartPretrain offers a versatile, model-agnostic, and dataset-agnostic self-supervised learning framework designed to enhance motion prediction in autonomous vehicles.
UvA - An Introduction to Group Equivariant Deep Learning.	Resources for studying deep learning techniques applied to specific types of geometric data while addressing architectural limitations.
Diffusion model simulating CS:GO.	An open-source replication of a diffusion model that generates visual simulations of a video game, using keyboard and mouse inputs to influence the output.
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs.	This study addresses the shortcomings of current alignment algorithms in large language models (LLMs), which tend to overfit to relative preferences and neglect response quality. The authors introduce reward-conditioned LLM policies and a novel data relabeling method that incorporates response quality, enabling the model to better generalize to optimal responses.
entropix.	Entropix is a tool designed to modify the sampling behavior of language models.
LoLCATs Blog Part 2: How to Linearize LLMs for Me and You.	Hazy Research has published another insightful post that delves into techniques for linearizing existing language models while maintaining much of their performance. This exploration highlights methods to simplify model architectures, making them more efficient, without significantly compromising their effectiveness in tasks like text generation and understanding.
TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control.	TextCtrl is a newly introduced diffusion-based method designed to enhance scene text editing. It achieves a balance between maintaining content accuracy and preserving the original style, ensuring that both the textual content and the visual appearance remain consistent during edits.
Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies.	iDP3 is an advanced 3D visuomotor policy designed to enable humanoid robots to autonomously navigate and perform tasks in a variety of real-world environments. This improved policy enhances the robot's ability to perceive and interact with its surroundings, making it more adaptable and efficient in complex and dynamic settings.
tabled.	Tabled is a small library for detecting and extracting tables. It uses Surya to find all the tables in a PDF, identifies the rows/columns, and formats cells into markdown, csv, or html.
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer.	HART is a cutting-edge visual generation model designed to produce high-quality 1024x1024 images, presenting a challenge to the capabilities of diffusion models. It enhances image reconstruction and reduces training costs by employing a hybrid tokenizer that integrates both discrete and continuous tokens, resulting in more efficient and effective image generation.
DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention.	The Deformable Bi-level Routing Attention (DBRA) module is an innovation designed to enhance attention mechanisms in vision transformers. DeBiFormer, which is built upon DBRA, optimizes the selection of key-value pairs in the attention process, resulting in more efficient computations and better interpretability of queries within attention maps. This leads to improved performance and understanding of how the model attends to different parts of an image.
Six tips for going public with your lab’s software.	It’s not enough to write high-quality programs. If you want to make your apps public — and usable — you should also follow these steps.
CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos.	CoTracker is a newly developed tracking model that bridges the performance gap between synthetic and real video data by employing semi-supervised training techniques.
A Consistency-Aware Spot-Guided Transformer for Versatile and Hierarchical Point Cloud Registration.	Researchers have developed a novel consistency-aware spot-guided Transformer designed to improve the efficiency and accuracy of point cloud registration.
Ditto - the simplest self-building coding agent.	Ditto is a user-friendly tool that allows you to generate a multi-file Flask application from simple natural language descriptions using a no-code interface. By leveraging a simple LLM loop with a few tools, Ditto automates the coding process, (occasionally) turning your ideas into functional web applications (or at least trying and getting close).
F5 Text-to-Speech System.	F5-TTS is a non-autoregressive, zero-shot text-to-speech system featuring a flow-matching mel spectrogram generator and a diffusion transformer. Developed on the MLX framework, F5 outperforms earlier systems such as E2 TTS by incorporating ConvNeXT v2 blocks for improved text alignment, enabling high-quality speech generation in approximately 11 seconds on modern hardware.
Movie Gen Bench.	"Movie Gen Bench" is an evaluation benchmark designed to assess performance in both video (Video Bench) and audio (Audio Bench). It includes 1,003 prompts that encompass a variety of testing aspects and concepts.
LongAlign.	LongAlign enhances the capability of text-to-image (T2I) diffusion models to process lengthy text inputs by incorporating segment-level encoding and a decomposed preference optimization approach.
Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective.	DiGIT is an auto-regressive generative model that forecasts tokens in a latent space through self-supervised learning. This discrete tokenizer enhances image generation on ImageNet by clustering hidden states derived from DINOv2.
FL-Launching (Fling).	The FedPart method tackles the layer mismatch problem in federated learning by limiting model updates to designated layers in each training round.
Distributed Training Guide.	This is an in-depth guide on best practices for distributed training, troubleshooting errors, and maximizing the use of available resources.

Perspectives

Link	description
Nobel winner Geoffrey Hinton is the ‘godfather of AI’. Here’s an offer he shouldn’t refuse…	The computer scientist’s dogged belief in the potential of neural networks helped unlock machine learning. But he’d be wise to remember the experience of a fellow laureate
Machines of Loving Grace.	Dario Amodei, CEO of Anthropic, often writes internal memos, and one of them was published externally. In this memo, he explores the potential extremely positive impact of successfully building powerful AI systems. He envisions how AI could radically transform the world for the better, improving areas like science, economics, and societal well-being, while acknowledging the immense responsibility of ensuring AI development is aligned with human interests and safety.
This AI-Powered Invention Machine Automates Eureka Moments.	Iprova's AI-driven software analyzes diverse technical literature to generate patentable inventions by linking previously unrelated ideas. It uses semantic search and generative AI to identify novel inventions for companies like Procter & Gamble and Panasonic. Although AI plays a key role, human insight remains essential for applying the inventions practically, especially in fast-evolving industries. Iprova highlights the importance of human creativity in refining and validating invention ideas, ensuring that AI serves as a tool to enhance rather than replace human innovation.
Burn the Playbooks.	AI excels at tasks that follow structured rulesets, such as automating tax processes or solving math problems, where it can often outperform humans. However, relying too much on playbook-driven approaches in our work risks stifling human creativity, a key trait that differentiates us from machines. Overemphasizing formulaic tasks could make us more dependent on AI's strengths, limiting our own unique creative potential and inadvertently making us more "machine-like" in areas where creativity and flexibility are crucial.
Hurricane Helene and the ‘Fuck It’ Era of AI-Generated Slop.	An AI-generated image depicting Hurricane Helene has gone viral, despite viewers being fully aware that it isn't real. The image has sparked widespread attention and discussion, highlighting the power of AI-generated content to captivate audiences even when the authenticity is known. This trend reflects the growing influence of AI in shaping public perception and the viral nature of digital content.
OpenAI pursues public benefit structure to fend off hostile takeovers.	OpenAI is planning to restructure as a public benefit corporation (PBC) to safeguard against hostile takeovers and ensure its mission of benefiting humanity remains intact. This change will help OpenAI maintain its commitment to ethical AI development, prioritizing public good over profit while allowing the organization to continue innovating in a sustainable and mission-driven way.
Al Will Take Over Human Systems From Within.	In this post, Yuval Noah Harari, the Israeli historian and author of “Sapiens,” “Homo Deus,” and “Nexus,” explores the impact of information networks and AI on societal narratives, which can either unite or fragment communities. He cautions that AI, functioning as an "alien intelligence," could centralize power due to its lack of self-correcting mechanisms, potentially threatening democratic systems. Harari stresses the importance of strong institutions to uphold truth in a world increasingly influenced by AI-driven decision-making across different sectors.
Sticky humans in a post-AGI world.	AI tutors encounter considerable difficulties in replicating the social and intellectual interactions offered by human teachers. Although AI has made progress, it still falls short in handling complex educational tasks and cannot deliver the nuanced socio-intellectual experiences that human educators provide. A hybrid approach, where AI complements rather than replaces human teachers, may be more effective, given the essential social and cultural elements of the learning process.
AI has dreamt up a blizzard of new proteins. Do any of them actually work?	Emerging protein-design competitions aim to sift out the functional from the fantastical. But researchers hope that the real prize will be a revolution in the field.
Considerations for governing open foundation models.	Foundation models drive AI innovation, but debates on their release—whether open or closed—raise concerns about potential risks and the impact of regulations on innovation.
I AI-generated some podcasts – and the results are uncanny.	Google’s new tool NotebookLM lets you create podcasts at the click of the button. They’re way more realistic than you’d think …
SB 1047: Our Side Of The Story.	California's proposed SB 1047, which sought to require AI companies to address existential risks posed by their technologies, was vetoed by Governor Newsom. He argued that the bill did not adequately regulate smaller, potentially dangerous AI models. Despite strong support from AI safety advocates like Dan Hendrycks and high-profile figures such as Elon Musk, the bill faced opposition from major AI companies, including OpenAI and Google. Newsom's veto has sparked discussions within the AI community about future regulatory strategies and potential collaborations with broader political groups to create comprehensive AI safety measures.
Overview of strong human intelligence amplification methods.	Advancements in AI depend on developing humans with enhanced cognitive abilities to effectively manage the complexities of AGI development. Approaches such as brain emulation, genomic modifications, adult brain gene editing, and brain-brain interfaces are being explored, each presenting distinct challenges and risks. These efforts are aimed at solving deep philosophical issues, significantly amplifying human intelligence, and addressing the potential threats posed by AGI.
LLMs don’t do formal reasoning - and that is a HUGE problem.	A study conducted by Apple raises questions about the effectiveness of large language models (LLMs), revealing that they primarily depend on pattern matching instead of formal reasoning. This reliance results in fragile and inconsistent outcomes, challenging the robustness of LLMs in tasks requiring deeper cognitive processes.
Why ChatGPT maker OpenAI is at fight with Open AI.	OpenAI is currently engaged in a legal dispute with Guy Ravine's company, Open AI, over the rights to the "Open AI" name and the original open-source AI vision. The conflict centers on ownership of the name and the direction of the open-source principles that initially defined the AI development approach.
AI mediation tool may help reduce culture war rifts, say researchers.	System built by Google DeepMind team takes individual views and generates a set of group statements
Here’s the deal: AI giants get to grab all your data unless you say they can’t. Fancy that? No, neither do I.	Data is vital to AI systems, so firms want the right to take it and ministers may let them. We must wake up to the danger
Where’s The Generative AI ROI? Start With The Supply Chain.	Generative AI is revolutionizing supply chain operations by effectively managing unstructured documents, resulting in substantial time and cost savings. Flexport, a technology company focused on supply chain solutions, has effectively implemented AI to automate and optimize document management, cutting processing time by 80%. This use of AI highlights its practical value in revenue-generating activities rather than merely in theoretical advancements.

Back to index

ML news: Week 7 - 13 October

Research

Link	description
A multimodal generative AI copilot for human pathology.	PathChat is a vision-language AI assistant designed for pathology, combining a foundational vision encoder and a large language model, achieving state-of-the-art performance on diagnostic tasks and outperforming other multimodal AI systems, with potential applications in education, research, and clinical decision-making.
Meta Movie Gen.	Meta has developed a cutting-edge movie model with 30 billion parameters, which required 6,144 H100 GPUs for training. The model was trained using 1 billion images and 100 million carefully selected videos. Notably, it is based on a Temporal Autoencoder and incorporates Flow matching Llama. Meta also published a highly detailed 92-page research paper, making it one of the most comprehensive reports on the subject.
When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1.	Large language models face limitations because they rely on next token prediction. Although OpenAI's o1 model was trained with a new objective focused on reasoning traces, it still exhibits some of the same constraints associated with next token prediction.
Contextual Document Embeddings.	This paper presents a method similar to a neutral TF/IDF, as it gathers information from the entire corpus rather than relying on individual document embeddings. It effectively captures contextual information from surrounding documents and has achieved state-of-the-art results on the MTEB benchmark.
PairDistill: Pairwise Relevance Distillation for Dense Retrieval.	This project introduces a novel technique called Pairwise Relevance Distillation (PairDistill), aimed at enhancing the accuracy of dense retrieval methods.
Modeling relationships to solve complex problems efficiently.	Associate Professor Julian Shun develops high-performance algorithms and frameworks for large-scale graph processing.
Factual Accuracy in AI.	Integrative Decoding is a technique designed to improve the factual accuracy of large language models, particularly for open-ended tasks. This method helps ensure more reliable and accurate outputs by refining the model's ability to integrate information during generation.
Dynamic Diffusion Transformer.	The Dynamic Diffusion Transformer (DyDiT) improves the efficiency of diffusion models in image generation by building on the Diffusion Transformer (DiT). It achieves this by dynamically adjusting computational resources across different timesteps and spatial regions, minimizing redundancy and optimizing performance.
Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach.	The Frame-Aware Video Diffusion Model (FVDM) enhances video generation by overcoming the limitations of existing models. Instead of using a single timestep for the entire video clip, FVDM introduces a vectorized timestep variable, enabling each frame to follow its own noise schedule. This approach improves the quality and coherence of generated videos.
What Matters for Model Merging at Scale?	Model merging is a technique that allows the combination of two models to achieve the performance benefits of both. However, it does not always scale effectively with larger model sizes. This paper investigates the requirements and challenges for making model merging work efficiently with very large models, addressing issues related to scalability, performance trade-offs, and optimal merging strategies.
nGPT: Normalized Transformer with Representation Learning on the Hypersphere.	A significant amount of research effort is focused on normalizing the internal representations of language models. This study demonstrates that by placing every internal vector on a hypersphere, convergence time is significantly reduced for models of reasonable size, leading to more efficient training.
Genomic Foundation Model Benchmarking.	GFMBench is a newly developed framework aimed at tackling challenges in the development of genomic foundation models (GFMs) by offering standardized benchmarking tools. It supports the evaluation of GFMs with millions of genomic sequences and hundreds of tasks, automating the benchmarking process for open-source GFMs to streamline their development and comparison.
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations.	This study provides further evidence that language models internally encode signals when they produce non-factual information. Understanding these internal cues can help guide models more effectively and reduce the occurrence of hallucinations, offering a potential strategy for improving their reliability.
Differential Transformer.	Transformers often over-allocate attention to irrelevant context, leading to inefficiencies. This research presents the Diff Transformer, which enhances attention to relevant information while filtering out noise. It introduces a differential attention mechanism that computes attention scores by subtracting two separate softmax attention maps. This subtraction effectively cancels out noise and encourages sparse, more focused attention patterns, improving the model's performance on tasks requiring precise context understanding.

News

Link	description
Brave New World: Leo AI and Ollama Bring RTX-Accelerated Local LLMs to Brave Browser Users.	Nvidia's RTX-Acceleration combined with Ollama allows for running local models in the browser.
Liquid Foundation Models.	Liquid AI has introduced its first generation of Liquid Foundation Models (LFMs), offering state-of-the-art performance while minimizing memory consumption. The LFMs, which are optimized for different hardware platforms, include 1B, 3B, and 40B parameter models. These models are already accessible on platforms like LIQUID PLAYGROUND and will soon be available on Cerebras. They are particularly adept at processing sequential data and provide innovations in efficiency and scalability across industries like financial services and biotechnology.
Introducing Copilot Labs and Copilot Vision.	Microsoft is launching Copilot Labs to test advanced AI tools, including Think Deeper and Copilot Vision. These tools aim to expand the capabilities of their AI systems, offering enhanced functionality and deeper insights.
OpenAI’s DevDay brings Realtime API and other treats for AI app developers.	It’s been a tumultuous week for OpenAI, full of executive departures and major fundraising developments, but the startup is back at it, trying to convince developers to build tools with its AI models at its 2024 DevDay. The company announced several new tools Tuesday, including a public beta of its “Realtime API”, for building apps with low-latency, AI-generated voice responses. It’s not quite ChatGPT’s Advanced Voice Mode, but it’s close.
Microsoft brings AI-powered overviews to Bing.	Microsoft has introduced Bing generative search, an AI-driven feature that gathers and summarizes information from the web, offering users more concise and aggregated search results.
KoBold Metals, which uses AI to help find critical minerals for the energy transition, raises $491M.	Earlier this year, KoBold Metals found what might be one of the largest high-grade copper deposits of all time, with the potential to produce hundreds of thousands of metric tons per year, the company’s CEO said.
OpenAI gets $4 billion revolving credit line, giving it more than $10 billion in liquidity.	OpenAI has secured over $10 billion in liquidity, achieving a valuation of $157 billion following its latest funding round. The company raised $6.6 billion from key investors, including Microsoft and Nvidia, but is contending with substantial operational costs, particularly the need for additional GPUs to support large language model (LLM) training. OpenAI is currently exploring restructuring strategies to enhance financial growth and sustainability within the AI industry.
Black Forest Labs, the startup behind Grok’s image generator, releases an API.	Black Forest Labs, the Andreessen Horowitz-backed startup behind the image generation component of xAI’s Grok assistant, has launched an API in beta — and released a new model.
DataPelago raises $47M to optimize hardware for analytical workloads.	LLMs depend on vast amounts of unstructured data for training, but this data requires extensive cleaning and processing before it becomes useful. Traditional data processing systems, which are based on CPUs and current software architectures, were not designed to handle the scale and complexity of such data, resulting in slow and costly data preparation that hinders AI development. To address these challenges, DataPelago has introduced a Universal Data Processing Engine, designed to overcome performance, cost, and scalability limitations, making AI development faster and more affordable.
Google brings ads to AI Overviews as it expands AI’s role in search.	Google will begin to show ads in AI Overviews, the AI-generated summaries it supplies for certain Google Search queries, and will add links to relevant web pages for some of those summaries as well. It’s also rolling out AI-organized search results pages in the U.S. this week.
Nobel Physics Prize Awarded for Pioneering A.I. Research by 2 Scientists.	Two scientists who contributed to the development of neural networks have been awarded the Nobel Prize in Physics, recognizing their groundbreaking work in advancing artificial intelligence and neural network technologies.
Introducing the Message Batches API.	Anthropic has introduced a new batch processing API that allows developers to submit batches of up to 10,000 queries at once. Each batch is processed within 24 hours and is 50% cheaper than standard API calls, making it a more efficient and cost-effective solution for handling non-time-sensitive tasks.
Update on Reflection-70B.	A detailed post-mortem analysis of the highly anticipated Reflection-70B model revealed issues with its benchmark code, which inflated its performance claims. Although the team has since corrected these bugs, and the model's performance remains impressive, it does not quite reach the originally advertised levels.
Four-legged robot learns to climb ladders.	The proliferation of robots like Boston Dynamics’ Spot has showcased the versatility of quadrupeds. These systems have thrived at walking up stairs, traversing small obstacles, and navigating uneven terrain. Ladders, however, still present a big issue — especially given how ever present they are in factories and other industrial environments where the systems are deployed.
Braintrust raises $36M Series A.	Braintrust, which helps Airtable, Brex, Notion, and Stripe build AI products, has raised $36M in a Series A led by a16z.
Clout Kitchen raises $4.45M for AI gaming pal that mimics content creators.	Clout Kitchen announced today that it has raised $4.45 million in its seed funding round, which it plans to put towards its new creator-powered products and experiences. The first of these is Backseat AI, an AI-powered buddy for League of Legends that the company created with Tyler “Tyler1” Steinkamp — an AI buddy that can take on the aspect of popular gaming content creators. Clout Kitchen plans to use its funding to expand its team and build out its shared internal tech stack.
AlphaFold wins Nobel Prize in Chemistry.	Demis Hassabis, John Jumper, and David Baker were awarded the Nobel Prize in Chemistry for their groundbreaking work in protein folding, particularly through innovations like AlphaFold. Their contributions have significantly advanced the understanding of protein structures and their implications for science and medicine.
OpenAI reducing dependency on Microsoft data centers.	OpenAI is decreasing its reliance on Microsoft's data centers by acquiring its own compute infrastructure, allowing greater independence in its operations. Simultaneously, Microsoft is reducing its dependence on OpenAI as it develops and competes with its own AI products, signaling a shift in the dynamics of their partnership.
TikTok parent company ByteDance has a tool that's scraping the web 25 times faster than OpenAI.	TikTok parent company ByteDance is amassing huge volumes of web data way faster than the other major web crawlers. ByteDance may be planning to release its own LLM, and is aggressively using its web crawler, "Bytespider," to scrape up data to train its models, Fortune reported.
Sonair takes a cue from dolphins to build autonomous 3D vision without lidar.	Ultrasound is perhaps best known as the technology that enables noninvasive body scans and underwater communication and can help us park our cars. A young startup called Sonair out of Norway wants to employ it for something else: 3D computer vision used in autonomous hardware applications.
Tesla’s head of vehicle programs jumps to Waymo ahead of robotaxi reveal.	Tesla has lost a top executive to Waymo in the lead-up to the EV maker’s robotaxi unveiling on Thursday.
Autism ABA Therapy with Llama.	Meta shares a use case of its Llama model for medical and therapeutic benefit.
Uber’s EV ridehailing business is maturing.	The company also announced it was adding ChatGPT to its driver app to handle EV questions.
Amazon’s new AI guides can help shoppers find what they need.	The new AI Shopping Guides feature aims to help users find what they need with more informed product suggestions.
TikTok joins the AI-driven advertising pack to compete with Meta for ad dollars.	TikTok's Smart+ is an AI-powered ad-buying tool designed to automate and optimize ad campaigns, giving marketers the option to selectively utilize its features for enhanced performance. The tool seeks to rival Meta's Advantage+ by offering streamlined ad management and improved return on investment (ROI). Early results indicate significant gains in ad spend efficiency and conversion rates, positioning TikTok as a strong contender in the digital advertising market.
OpenAI partners with Cosmopolitan and Elle publisher Hearst.	ChatGPT will provide citations and direct links to the company's content.
Meta debuts new generative AI tools for creating video-based ads.	Meta Platforms Inc. today said it’s rolling out a full-screen video tab on Facebook in recognition of the fact that its users spend more time watching videos than anything else on its platforms.

Resources

Link	description
Introducing the Open FinLLM Leaderboard.	The Open FinLLM Leaderboard provides a dedicated evaluation platform designed specifically for financial language models. It emphasizes key financial tasks like predicting stock movements, analyzing sentiment, and extracting information from financial reports.
Infinite-Fractal-Stream: Small Scale Proxy for Scaling-Centric ML.	Model testing in the image domain is often constrained by low-quality, small datasets like CIFAR10. This GitHub repository provides a tool that generates infinite, complex fractals in the form of images or videos, offering a new approach for testing models.
Auto Jobs Applier.	A highly viral repository leverages language models to automate the job application process, adding an extra layer of personalization to tailor applications for each position.
Real-World Benchmarks Make Membership Inference Attacks Fail on Diffusion Models.	This study uncovers major weaknesses in existing membership inference attacks (MIAs) used to detect unauthorized data usage in diffusion models. It introduces CopyMark, a more realistic benchmark for assessing MIAs on pre-trained models, providing unbiased datasets and fair evaluation techniques to improve the accuracy and reliability of these attacks.
ImageFolder: Autoregressive Image Generation with Folded Tokens.	ImageFolder is a semantic tokenizer developed to balance the trade-off between image reconstruction accuracy and generation quality in visual generative models, improving the overall performance of these models in both tasks.
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models.	Grounded-VideoLLM is a novel Video-Large Language Model (Video-LLM) created to enhance the fine-grained understanding of specific moments in videos. By incorporating a temporal stream and discrete temporal tokens, the model more effectively captures the relationships between frames and timestamps, improving its ability to interpret and analyze detailed video content.
Autoregressive Action Sequence Learning for Robotic Manipulation.	The Chunking Causal Transformer (CCT) is a new autoregressive architecture developed specifically for robotic manipulation tasks. It is designed to improve the model's ability to process sequential data efficiently, optimizing performance in real-time robotic control and manipulation scenarios.
FacePoke.	FacePoke is a tool designed for rapid editing of faces in both videos and images, allowing users to make quick adjustments and modifications with ease.
pipeline_parallel.py.	A large model training lead at Hugging Face has shared an excellent 200-line example of parallelism built from scratch, demonstrating efficient techniques for distributing computational tasks, which is particularly useful for large-scale model training.
CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs.	As language models become increasingly proficient at writing code, many existing benchmarks are approaching saturation. This paper proposes a more challenging benchmark designed to assess how well models perform on reasoning and code generation tasks, pushing beyond basic code-writing capabilities to evaluate deeper problem-solving skills.
Intensify.	Intensify is a Python package that allows you to colorize text based on intensity values. It provides an easy-to-use interface for applying color gradients to text or background colors in the terminal.
Beyond FVD: Enhanced Evaluation Metrics for Video Generation Quality.	JEDi is a new metric built on the Joint Embedding Predictive Architecture (JEPA), designed to enhance evaluation accuracy with fewer samples. It better aligns with human assessments, making it a more robust alternative to the FVD (Fréchet Video Distance) metric for evaluating generative models.
PRFusion: Toward Effective and Robust Multi-Modal Place Recognition with Image and Point Cloud Fusion.	PRFusion and PRFusion++ are multimodal models developed to enhance place recognition in robotics and computer vision. By combining information from multiple sensory inputs, these models improve the accuracy and robustness of place recognition tasks, making them more effective in real-world applications.
Fine-Tuning CLIP's Last Visual Projector: A Few-Shot Cornucopia.	This paper presents ProLIP, a novel method for adapting vision-language models such as CLIP without adding additional parameters. ProLIP fine-tunes only the final projection matrix of the vision encoder, enabling it to deliver strong performance in few-shot classification tasks while maintaining the model's efficiency.
ScienceAgentBench.	The benchmark code for the science agent test is designed to evaluate how effectively models can contribute to novel scientific discoveries. It provides a framework for assessing a model's ability to generate innovative ideas, solve complex scientific problems, and make meaningful advances in various scientific fields.
Controlled Visual Generation.	Controllable AutoRegressive Modeling (CAR) is a novel framework that introduces precise control mechanisms to pre-trained visual autoregressive models. This method enables more refined and targeted image generation by progressively improving control representations, allowing for fine-tuned outputs with reduced computational resources.
PredFormer: Transformers Are Effective Spatial-Temporal Predictive Learners.	PredFormer is a newly developed transformer-based method for spatiotemporal predictive learning, offering superior performance in both accuracy and efficiency compared to existing approaches. It excels in tasks that involve predicting changes over time and space, making it a powerful tool for various applications in fields like video analysis, weather forecasting, and robotics.
GenSim2: Scaling Robotic Data Generation with Multi-modal and Reasoning LLMs.	This paper presents an innovative approach to scaling robotic data collection by utilizing an enhanced, high-quality physics simulation dataset. The improved simulation environment enables more efficient data generation for training robots, offering a scalable and cost-effective method to collect large amounts of accurate and diverse data for robotic learning and development.
Learning Efficient and Effective Trajectories for Differential Equation-based Image Restoration.	This project introduces a novel differential equation-based approach for image restoration. By leveraging mathematical models grounded in differential equations, the method enhances the ability to recover and restore degraded or noisy images, providing improved accuracy and performance in image restoration tasks.
Pixtral 12B.	The Mistral team has provided detailed insights into the training process and architecture of their vision-language model, which has demonstrated solid performance. The model incorporates advanced techniques for effectively integrating visual and linguistic data, allowing it to perform well on a variety of tasks that require understanding both images and text. The shared information includes specifics on data preprocessing, model architecture, and the optimization strategies employed during training.
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering.	MLE-bench is a benchmark created to evaluate AI agents' capabilities in machine learning engineering. It includes a curated selection of 75 Kaggle competitions to test various skills, such as model training, dataset preparation, and optimization. The benchmark aims to assess how well AI agents can handle practical machine learning tasks, providing a comprehensive evaluation of their engineering proficiency.
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate.	The Modality Integration Rate (MIR) is a new metric designed to evaluate the effectiveness of multi-modal pre-training in Large Vision Language Models. It measures how well different modalities, such as visual and textual data, are integrated during the pre-training process, offering insights into the model's ability to leverage information from both sources to improve performance on multi-modal tasks.
Aria: First Open Multimodal Native MoE Model.	A highly impressive new vision-language model has been released with open weights, code, and a comprehensive research report. It achieves performance on par with closed models for long video understanding, a challenge that has proven difficult for other open models like Pixtral and Molmo. This advancement represents a significant breakthrough in the field of open-source vision-language models.
IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation.	IterComp is a new framework developed to enhance compositional text-to-image generation by integrating the strengths of multiple advanced diffusion models, including RPG, Stable Diffusion 3, and FLUX. By leveraging these models, IterComp improves the quality and coherence of generated images, especially when handling complex textual prompts that require multiple elements to be composed accurately.
MatMamba.	MatMamba is a novel architecture for sequence processing, building upon the Mamba2 framework by incorporating a Matryoshka-like design. This approach allows a single model to be trained at multiple granularities, enabling the extraction of various smaller, nested submodels. This hierarchical structure enhances flexibility and efficiency, allowing the model to adapt to different levels of complexity and resource constraints.
O1 replication progress report.	Researchers from GAIR and NYU have been investigating the critical algorithmic advancements behind OpenAI's o1 model's exceptional performance. In their report, they introduce the concept of "Journey Learning" data, a novel approach that, when used in training, boosts math performance by 8% in absolute terms. This innovation highlights how specific data types can significantly enhance a model's reasoning and problem-solving abilities.

Perspectives

Link	description
Nuclear power for AI: what it will take to reopen Three Mile Island safely.	As Microsoft strikes a deal to restart a reactor at the notorious power station, Nature talks to nuclear specialists about the unprecedented process.
‘In awe’: scientists impressed by latest ChatGPT model o1.	The chatbot excels at science, beating PhD scholars on a hard science test. But it might ‘hallucinate’ more than its predecessors.
Can AI have common sense? Finding out will be key to achieving machine intelligence.	The advent of LLMs has reopened a debate about the limits of machine intelligence — and requires new benchmarks of what reasoning consists of.
How your brain detects patterns in the everyday: without conscious thought.	Neurons in certain brain areas integrate ‘what’ and ‘when’ information to discern hidden order in events in real time.
AI to the rescue: how to enhance disaster early warnings with tech tools.	Artificial intelligence can help to reduce the impacts of natural hazards, but robust international standards are needed to ensure best practice.
Before Mira Murati’s surprise exit from OpenAI, staff grumbled its o1 model had been released prematurely.	OpenAI's accelerated development and safety testing of its latest models, such as GPT-4o and o1, have led to internal friction, resulting in the departure of several senior staff members. The rapid pace of development has raised concerns about the thoroughness of the safety protocols, contributing to tensions within the organization.
I Quit Teaching Because of ChatGPT.	This professor resigned from teaching due to the widespread use of large language models (LLMs) like ChatGPT among students, which they felt undermined academic integrity and the traditional learning process.
Three Subtle Examples of Data Leakage.	This article examines the risks of data leakage in machine learning, showcasing two real-world cases where improper data handling resulted in misleading model performance. In one instance, a company incorrectly filtered data by an upper price limit before modeling, while another organization encountered problems by not following a strict chronological split. The key lessons emphasize the critical need for detecting data leakage and understanding its detrimental effects on model accuracy and reliability.
The real data wall is billions of years of evolution.	AI development is encountering a potential obstacle known as the "data wall," as language models near the limit of available textual data for training. This article challenges the idea of using human analogies to overcome these data constraints, pointing out that human intelligence results from vast amounts of data and long evolutionary processes, which differ fundamentally from AI. While human learning strategies may not directly translate to AI, this doesn't preclude progress through other modalities, such as multimodal data, or advancements in algorithms that could push AI capabilities further.
AI will use a lot of energy. That's good for the climate.	AI data centers are significantly increasing the demand for clean, 24/7 energy, prompting tech giants to invest heavily in renewable and nuclear power solutions. This growing demand is expected to accelerate the cost reduction of clean energy technologies, driven by their learning rates. Over time, the energy needs of AI could lead to policy shifts and advancements in clean energy infrastructure, fostering faster adoption and development of sustainable energy sources.
I want to break some laws too.	This article explores the use of an automated data cleaning pipeline inspired by the Minipile method, which prunes datasets to deliver significant performance gains with only a fraction of the original data size. By leveraging techniques such as few-shot prompting and clustering, the approach streamlines dataset refinement for AI training, challenging traditional scaling laws by prioritizing data quality over quantity. The results indicate that using foundational datasets with more refined data can optimize AI model training, reducing resource consumption while boosting performance.

Back to index

ML news: Week 30 September - 6 October

Research

Link	description
PGN: The RNN's New Successor is Effective for Long-Range Time Series Forecasting.	The Parallel Gated Network (PGN) is an innovative architecture developed to address the challenges that traditional RNNs face in managing long-term dependencies. Shortening the information propagation path and incorporating gated mechanisms efficiently captures past and present time step data.
Taming Diffusion Prior for Image Super-Resolution with Domain Shift SDEs.	DoSSR is a diffusion-based super-resolution model that improves both performance and efficiency by utilizing pre-trained diffusion models and initiating the process with low-resolution images. This approach accelerates the super-resolution process while maintaining high-quality results.
MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models.	MaskLLM is a pruning technique designed to decrease the computational load of large language models by introducing learnable sparsity. This method optimizes performance while maintaining model efficiency by selectively reducing the number of active parameters.
Law of the Weakest Link: Cross Capabilities of Large Language Models.	This project emphasizes the importance of evaluating large language models (LLMs) based on their combined abilities rather than focusing solely on individual skills. While most models are trained on specialized datasets that target specific capabilities, real-world tasks frequently demand a blend of expertise across different areas, known as cross-capabilities. This approach ensures that models are better suited to handle complex, multifaceted challenges.
Scaling Optimal LR Across Token Horizon.	This paper investigates how to adjust the learning rate as a model's training data increases. While LLaMA applied an exponential scaling factor of -0.28, the paper proposes using an exponential scaling factor of -0.33 for improved performance during training with larger datasets.
Knowledge Graph Embedding by Normalizing Flows.	This paper presents a novel approach to knowledge graph embedding by leveraging group theory to incorporate uncertainty into the process. This method allows for more nuanced and flexible representations of relationships within knowledge graphs, enhancing the model's ability to handle uncertain or ambiguous information.
How AI is improving simulations with smarter sampling techniques.	MIT CSAIL researchers created an AI-powered method for low-discrepancy sampling, which uniformly distributes data points to boost simulation accuracy.

News

Link	description
Apple not investing in OpenAI after all, new report says.	Apple is no longer planning to invest in OpenAI, according to a new report from The Wall Street Journal. This comes as OpenAI plans to close a $6.5 billion funding round next week, with investments possible from both Microsoft and Nvidia.
Arcade AI raises 17M to transform commerce.	Arcade AI, a generative product company that launched this week, has announced securing funding from prominent investors as it aims to develop its "prompt to product" system. This system enables the immediate creation of products that are ready for purchase, streamlining the process from concept to consumer.
They stole my voice with AI.	Elecrow is suspected of using AI to clone a voice for promotional videos without consent.
Amazon-backed Anthropic in talks to raise money at $40B valuation: report.	Anthropic, a generative AI startup backed by Amazon and other major tech companies, is in discussions to raise additional funding that could potentially value the company at $40 billion.
OpenAI Reportedly Slated for $500 Million SoftBank Investment.	SoftBank is planning to invest $500 million in OpenAI's latest funding round, which could raise OpenAI's valuation to as high as $150 billion. Microsoft is also participating in this round, highlighting OpenAI's rapid 1,700% revenue growth, despite the company anticipating losses of around $5 billion.
OpenAI Is Growing Fast and Burning Through Piles of Money.	As the company looks for more outside investors, documents reviewed by The New York Times show consumer fascination with ChatGPT and a serious need for more cash.
Altman reportedly asks Biden to back a slew of multi-gigawatt-scale AI datacenters.	OpenAI CEO Sam Altman is calling on the Biden administration to establish AI data centers in the US that could consume up to five gigawatts of power, aiming to maintain US technological leadership over China. The proposal includes building several large-scale data centers across the country. Meanwhile, other tech giants, such as Microsoft and Amazon, are securing nuclear power deals to support their growing AI operations.
Samsung's Galaxy Tab S10 Ultra and Galaxy Tab S10+ are tablets built for AI.	Samsung is once again expanding its tablet lineup, and this time, the company is doing so with AI at the forefront. Today, Samsung revealed the Galaxy Tab S10 series, two models that it says are "built with AI enhancements available right out of the box."
Tesla Full Self Driving requires human intervention every 13 miles.	It gave pedestrians room but ran red lights and crossed into oncoming traffic.
OpenAI Dev Day 2024.	OpenAI's Dev Day 2024 featured several exciting announcements, including the introduction of vision model fine-tuning, a real-time API, prompt caching for faster responses, and model distillation for more efficient deployment of large models. These advancements aim to enhance the capabilities and performance of AI applications across various domains.
Pika 1.5.	Pika has released version 1.5 with more realistic movement, big screen shots, and Pikaffects.
Gov. Newsom vetoes California’s controversial AI bill, SB 1047.	Governor Gavin Newsom has vetoed SB 1047, a proposed bill intended to regulate AI development and enforce safety protocols for high-cost models. Newsom expressed concerns that the bill's broad application to all large, computation-heavy models was not the most effective method for regulating AI. However, he reaffirmed his commitment to AI safety by signing several other AI-related bills and consulting with experts to ensure thoughtful regulation in the future.
OpenAI to remove non-profit control and give Sam Altman equity, sources say.	hatGPT-maker OpenAI is working on a plan to restructure its core business into a for-profit benefit corporation that will no longer be controlled by its non-profit board, people familiar with the matter told Reuters, in a move that will make the company more attractive to investors.
OpenAI's latest funding .	OpenAI has secured $6.6 billion in new funding, bringing its post-money valuation to $157 billion. Notable investors in this round include Microsoft and Nvidia, with the funds aimed at further scaling AI development and innovation.
Google adds a multi-functional quick insert key and new AI features to Chromebook Plus.	Google is announcing new Chromebook models today with Samsung and Lenovo. With Samsung’s Galaxy Chromebook Plus model in particular, the company is also introducing a new multifunctional quick insert key. But Google doesn’t want to leave existing Chromebook users behind as it added new AI-powered features for existing devices.
Brain-like Computers Tackle the Extreme Edge.	Start-up BrainChip announces a new chip design for a milliwatt-level AI inference
AI Can Best Google’s Bot Detection System, Swiss Researchers Find.	Researchers from ETH Zurich used advanced machine learning to solve 100% of Google's reCAPTCHAv2, designed to distinguish humans from bots.
OpenAI Training Data to Be Inspected in Authors’ Copyright Cases.	At a secure room in its San Francisco office, representatives for authors suing OpenAI will examine materials that were used to train its AI system. They allege copyrighted works were utilized without their consent or compensation.
ByteDance will reportedly use Huawei chips to train a new AI model.	US export restrictions are preventing ByteDance from using NVIDIA chips.
Announcing FLUX1.1 [pro] and the BFL API.	FLUX1.1 [pro] has been released, offering six times faster generation speeds compared to its predecessor, alongside enhanced image quality and overall performance. The new beta BFL API introduces advanced customization options and competitive pricing, making it easier for developers to integrate FLUX’s capabilities. FLUX1.1 [pro] will be available across multiple platforms, providing greater scalability and efficiency for users and developers alike.
OpenAI launches new ‘Canvas’ ChatGPT interface tailored to writing and coding projects.	OpenAI introduced a new way to interact with ChatGPT on Thursday: an interface it calls “canvas.” The product opens a separate window, beside the normal chat window, with a workspace for writing and coding projects. Users can generate writing or code directly in the canvas, and then highlight sections of the work to have the model edit. Canvas is rolling out in beta to ChatGPT Plus and Teams users on Thursday, and Enterprise and Edu users next week.
Anthropic hires OpenAI co-founder Durk Kingma.	Durk Kingma, one of the lesser-known co-founders of OpenAI, today announced that he’ll be joining Anthropic.
OpenAI unveils easy voice assistant creation at 2024 developer event.	Altman steps back from the keynote limelight and lets four major API additions do the talking.

Resources

Link	description
🚀 FlowTurbo.	FlowTurbo is a method developed to accelerate the sampling process in flow-based models while maintaining high-quality outputs. It achieves faster results without compromising the precision or performance of the model.
Transformer4SED.	This repository presents the Prototype-based Masked Audio Model, which enhances sound event detection by leveraging unlabeled data more effectively. The method generates pseudo labels through a Gaussian mixture model, which directs the training of a Transformer-based audio model for improved performance.
VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models.	Vector Post-Training Quantization is a technique aimed at enabling ultra-low-bit quantization for large language models, optimizing memory and storage efficiency during deployment without significantly compromising performance.
LightAvatar: Efficient Head Avatar as Dynamic NeLF.	LightAvatar is a head avatar model that improves rendering speed and efficiency using neural light fields (NeLFs).
Separating code reasoning and editing.	Aider has significantly enhanced the performance of general-purpose code editing by employing o1 as the architect and DeepSeek as the writer. This collaboration streamlines the process, leading to more efficient and accurate code generation.
Heralax/Mistrilitary-7b.	This model was trained using army handbooks and incorporates deep, specialized knowledge that is uncommon in fine-tuned models. This unique training approach allows it to possess a rare level of expertise in military-related tasks and information.
Developing a go bot embedding ichiban Prolog.	Ichiban Prolog was integrated into Hellabot, a Go-based IRC bot, to eliminate the need for recompiling when adding new triggers. This integration enables dynamic Prolog code execution, allowing users to adjust the bot's logic in real-time. Future enhancements could focus on minimizing interpreter setup overhead and shifting more of the bot's logic into Prolog for greater flexibility and efficiency.
Emu 3 open early fusion multimodal model.	Emu 3 is a next-token prediction model that surpasses SDXL in image synthesis, LlaVa-1.6 in image understanding, and OpenSora 2 in video generation. With 9 billion parameters, Emu 3 is trained on these tasks in an interleaved manner, similar to Gemini, making it highly versatile and effective across multiple domains.
LOTUS: Diffusion-based Visual Foundation Model for High-quality Dense Prediction.	Using pre-trained diffusion models for tasks like depth estimation has become highly popular and effective. This work demonstrates how certain previous methods contained minor inaccuracies and presents improvements that not only boost performance but also significantly simplify the overall modeling process.
Revisit Anything: Visual Place Recognition via Image Segment Retrieval.	SegVLAD is a method for visual place recognition that emphasizes the analysis of image segments instead of relying on entire images. This approach enhances recognition accuracy by focusing on distinctive parts of the scene, making it more robust in various environments.
LeanRL - Turbo-implementations of CleanRL scripts.	LeanRL is a lightweight library consisting of single-file, pytorch-based implementations of popular Reinforcement Learning (RL) algorithms. The primary goal of this library is to inform the RL PyTorch user base of optimization tricks to cut training time by half or more.
E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding.	E.T. Bench is a newly developed benchmark created to assess the performance of video language models on fine-grained, event-level tasks. Unlike earlier benchmarks that emphasize video-level questions, E.T. Bench spans a variety of time-sensitive tasks across multiple domains, providing a more detailed evaluation of model capabilities.
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning.	Apple is continuing to strengthen its in-house AI capabilities by developing a robust multimodal foundation model. This initiative is part of Apple's broader efforts to integrate advanced AI technologies across its ecosystem, supporting tasks that span text, image, and other data modalities for enhanced user experiences.
The Perfect Blend: Redefining RLHF with Mixture of Judges.	Meta has introduced an impressive new paper detailing the use of a mixture of judges models to effectively conduct multi-task reinforcement learning with human feedback (RLHF) during post-training. This approach significantly enhances the final performance of models across various benchmarks, demonstrating superior results compared to previous methods.
A Survey on the Honesty of Large Language Models.	This survey explores the honesty of large language models (LLMs), a crucial aspect in aligning AI with human values. It addresses challenges such as models confidently providing incorrect answers and the difficulty in distinguishing between what the model knows and what it doesn't. The review highlights these obstacles as key areas for improving the reliability and trustworthiness of LLMs.
LexEval: A Comprehensive Benchmark for Evaluating Large Language Models in Legal Domain.	LexEval is a benchmark created to evaluate large language models (LLMs) specifically in the legal domain. Recognizing the critical need for accuracy, reliability, and fairness in legal applications, LexEval provides a framework for assessing the strengths and limitations of LLMs when applied to legal tasks, ensuring they meet the rigorous demands of the field.
Perceptual Compression (PerCo).	PerCo (SD) is a novel perceptual image compression technique built on Stable Diffusion v2.1, specifically designed for ultra-low bit ranges. This method leverages the power of diffusion models to achieve high-quality image compression at significantly reduced bitrates, optimizing storage and transmission without sacrificing visual fidelity.
nvidia/NVLM-D-72B.	Nvidia conducted a thorough ablation study on various methods of incorporating images into a language model. The results showed that the LlaVa concatenation approach outperformed the other methods, proving to be the most effective for integrating visual information into language models.
ProFD: Prompt-Guided Feature Disentangling for Occluded Person Re-Identification.	This paper introduces a new method called Prompt-guided Feature Disentangling (ProFD) to tackle occlusion challenges in person Re-Identification (ReID) tasks. ProFD helps separate relevant features from occluded or irrelevant ones, improving the accuracy and robustness of ReID models when identifying individuals in complex or obstructed environments.
Local File Organizer: AI File Management Run Entirely on Your Device, Privacy Assured.	This tool utilizes Llama 3.2 3B and Llava-1.6 to intelligently organize files on your computer into logical sections based on their content. By analyzing the data within the files, it categorizes and arranges them for easier navigation and more efficient file management.
Posterior-Mean Rectified Flow:Towards Minimum MSE Photo-Realistic Image Restoration.	Posterior-Mean Rectified Flow (PMRF) is a cutting-edge algorithm designed for photo-realistic image restoration. It improves the quality of restored images by refining the flow of information, resulting in highly accurate and visually appealing reconstructions.
RouterDC: Query-Based Router by Dual Contrastive Learning for Assembling Large Language Models.	RouterDC is an innovative method designed to enhance collaboration between multiple large language models (LLMs) through query-based routing. It utilizes contrastive learning to determine the most suitable model for each query, leading to improved performance compared to existing routing techniques. This approach optimizes model selection, ensuring more accurate and efficient responses.
Distributed Training of Deep Learning models .	This post provides an excellent introduction to the challenges and algorithms involved in distributed training for modern deep learning models. It explores the difficulties and bottlenecks of training models that are too large for a single GPU, including issues like communication overhead, synchronization, and memory limitations, while also discussing key techniques to overcome these obstacles.
ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation.	Instead of directly generating an image from a prompt, the authors created a workflow using a comfy UI node-based system to guide the image generation process. This approach significantly enhanced the final output quality, allowing for greater control and precision in the generation pipeline.
KnobGen.	KnobGen is a new framework developed to make sketch-based image generation more accessible to users of varying skill levels. By offering intuitive controls and simplified tools, KnobGen allows users to generate high-quality images from sketches, regardless of their artistic expertise.
Tiny Test Models.	AI researcher Ross Wightman has released a collection of models trained on ImageNet-1k that are remarkably small, with fewer than 1 million parameters. Despite their compact size, these models perform reasonably well and are designed to be easy to fine-tune, making them highly accessible for various applications where model efficiency is critical.
entropix.	Entropy-based sampling and parallel Chain of Thought (CoT) decoding are promising strategies for advancing reasoning models to match
Concordia.	DeepMind's Concordia repository enables the simulation of social interactions between individuals and groups at a reasonable scale. This platform allows researchers to model complex social behaviors, study group dynamics, and explore various interaction scenarios in a controlled, scalable environment.

Perspectives

Link	description
The Intelligence Age.	AI is set to enhance human abilities, empowering us to accomplish tasks that are currently beyond imagination. With the help of deep learning and more powerful computational tools, AI will drive innovations such as personalized assistants, learning tutors, and healthcare advisors. The emphasis should be on ensuring AI is widely accessible while addressing its potential risks, creating a path toward shared prosperity in the era of intelligent systems.
How AlphaChip transformed computer chip design.	AlphaChip is a reinforcement learning model that dramatically speeds up and improves chip design, creating layouts that surpass human capabilities. It produces optimized chip designs, such as those used in Google's TPUs, in just hours instead of weeks. This AI-powered approach has wide-ranging applications, benefiting not only Google's hardware but also external companies like MediaTek.
AI pareidolia: Can machines spot faces in inanimate objects?	New dataset of “illusory” faces reveals differences between human and algorithmic face detection, links to animal face recognition, and a formula predicting where people most often perceive faces.
Table Extraction using LLMs: Unlocking Structured Data from Documents.	This article discusses how large language models (LLMs) are transforming table extraction from complex documents, surpassing the limitations of traditional methods such as OCR, rule-based systems, and machine learning. LLMs offer greater flexibility and contextual comprehension, significantly improving accuracy in handling varied and intricate table structures. While challenges like hallucination and high computational demands remain, the integration of traditional techniques with LLMs currently provides the most effective solution for automated table extraction.
The Other Bubble.	Microsoft considered diverting its US-based server power to GPUs for AI purposes but ultimately abandoned the idea. Major tech companies like Microsoft, Google, and Amazon are making significant investments in AI, yet they continue to see underwhelming returns from generative AI applications. The industry's reliance on SaaS and the integration of AI tools, which frequently offer limited practical value while incurring substantial costs, underscores an increasing urgency to sustain growth in a slowing market.
AI's Privilege Expansion.	AI is quickly broadening access to services that were once expensive and difficult to obtain, such as education, healthcare, and personal styling. Generative AI models like ChatGPT offer affordable, personalized support by acting as tutors, healthcare advisors, and stylists, reducing the need for costly human professionals. This transformation democratizes access to high-end services, making them more widely available to the general public at a significantly lower cost.
Behind OpenAI’s Audacious Plan to Make A.I. Flow Like Electricity.	OpenAI CEO Sam Altman has proposed a global initiative to construct data centers and chip factories to drive advanced AI development. While Altman initially aimed for trillions in funding, he has now scaled back to targeting hundreds of billions. The plan envisions partnerships with global tech giants and governments, though it faces significant regulatory and logistical hurdles. Despite early skepticism, ongoing discussions suggest potential expansions across the US, Europe, and Asia to significantly increase computing power for AI advancements.
Devs gaining little (if anything) from AI coding assistants.	Code analysis firm sees no major benefits from AI dev tool when measuring key programming metrics, though others report incremental gains from coding copilots with emphasis on code review.
Negligence Liability for AI Developers.	This article advocates for a negligence-based approach to AI accountability, emphasizing the human factors and responsibilities behind AI systems. It critiques existing regulatory frameworks for neglecting the role of AI developers and highlights California's AI safety bill as a promising example. The article also delves into the complexities of defining "reasonable care" in AI development and the potential consequences of classifying AI developers as professionals, raising important questions about the standards and obligations they should meet.
I am tired of AI.	The author expresses frustration with the widespread marketing and overuse of AI, especially in fields like software testing and conference proposals. They argue that AI tools often prioritize speed at the expense of quality and fail to offer the unique insights that come from human-generated work. While acknowledging some useful applications of AI, the author criticizes the increasing amount of mediocre AI-produced content, seeing it as a detriment to innovation and depth in these areas.
The Four Short Term Winners of AI.	The global AI arms race is primarily driven by Big Tech companies, chipmakers such as NVIDIA, intellectual property lawyers, and the Big 4 consulting firms. These key players are competing to secure technological dominance, resources, and expertise in AI development, shaping the future of the industry through their influence and innovations.
The Art of the OpenAI Deal.	OpenAI's revenue soared to $300 million in August, with the company forecasting $3.7 billion in annual sales for this year and $11.6 billion for next year. However, it is facing a $5 billion annual loss. This rapid growth has been driven primarily by the widespread success of ChatGPT, which generates the majority of its revenue. Despite this momentum, OpenAI is actively seeking additional investors to cover its high operational costs and work towards becoming a profitable enterprise.
What comes after?	California Governor Gavin Newsom has vetoed SB 1047, a bill aimed at regulating large AI models. He stressed the importance of creating evidence-based regulations and cautioned that overly restrictive rules could hinder innovation. Instead, Newsom plans to collaborate with experts, including Dr. Fei-Fei Li, to develop empirical, science-driven guidelines that balance safety and progress in AI development.
Sorry, GenAI is NOT going to 10x computer programming.	Recent studies indicate that generative AI has not yet delivered the expected 10x improvement in coding productivity. While AI tools can assist with code generation and streamline certain tasks, the overall productivity gains have been more modest than initially projected, with challenges such as integration, context understanding, and debugging limiting the full potential of these technologies in real-world coding environments.

Back to index

ML news: Week 23 - 29 September

Research

Link	description
Moshi: a speech-text foundation model for real-time dialogue.	presents a full-duplex spoken dialogue framework and a speech-text basis paradigm; they also present several system components; Helium is a 7B parameter text LLM; Mimi is a semantic-acoustic neural audio code that achieves cutting-edge audio quality performance; and a hierarchical multi-stream architecture that can produce speech-to-speech from any given dialog.
Training Language Models to Self-Correct via Reinforcement Learning.	creates a multi-turn online reinforcement learning system that is fully based on self-generated data in order to enhance an LLM's ability to self-correct; It is demonstrated that SFT has a distribution mismatch between training data and model responses and is inefficient at learning self-correction; suggests a two-stage method that, when applied to the Gemini 1.0 Pro and 1.5 Flash models, achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1%, respectively, on the MATH and HumanEval benchmarks. The first stage of the method optimizes correction behavior, and the second uses a reward bonus to amplify self-correction during training.
On the Diagram of Thought.	strengthens LLMs' capacity for reasoning through rigorous mathematics; DAT represents iterative reasoning in LLM as the building of a directed acyclic graph; it combines propositions, criticisms, refinement, and verification into a single DAG structure; this enables DoT to capture sophisticated logical deduction that is beyond the scope of linear or tree-based methods
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning.	examines which tasks benefit most from chain-of-thought (CoT) prompting; following a meta-analysis of over 100 papers and multiple evaluations, it concludes that CoT leads to significant performance gains, mostly on math and logic tasks; the majority of the CoT gain is derived from improving symbolic execution, although a symbolic solver performs better than it.
A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B.	examines how instruction-tuned LLMs perform on models ranging from 7B to 405B using different quantization techniques. The main conclusions are that: 1) one should quantize a larger LLM to a similar size because a smaller FP16 LLM typically performs better across most benchmarks; 2) performance varies significantly with different quantization techniques, model size, and bit-width, with weight-only methods frequently producing better results in larger models; and 3) task difficulty does not significantly impact accuracy degradation due to quantization.
Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning.	uses an inner dialogue agent to act as a guide to dynamically adjust reasoning paths, allowing adaptive cross-path exploration and improving response accuracy. This makes it different from CoT and ToT, which are both rigid processes, in that its prompt generation is a dynamic process that allows it to adapt. suggests the Iteration of Thought (IoT) framework to improve the LLM responses and reasoning capabilities with adaptive reasoning paths.
Schrodinger's Memory: Large Language Models.	utilizes the Universal Approximation Theorem to describe how LLMs store memory. Additionally, it suggests a novel method for assessing LLM performance by contrasting the memory capacities of various models; the Transformer architecture serves as a dynamic fitting UAT model with a high degree of adaptability in fitting inputs, allowing LLMs to recall the entirety of the content with the least amount of input data.
Jailbreaking Large Language Models with Symbolic Mathematics.	generates mathematically encoded prompts using GPT-4o, which is a useful jailbreaking strategy; the average attack success rate over 13 state-of-the-art is 73.6%. This indicates that current safety training systems are not able to generalize to mathematically encoded inputs.
Iterative Object Count Optimization for Text-to-image Diffusion Models.	Generating a specific number of objects with a diffusion model is often a difficult task. This work introduces a counting token that enables the model to more accurately produce either a few or many instances of a given object. While it's not flawless and is based on the original stable diffusion model, it significantly outperforms existing methods.
A Controlled Study on Long Context Extension and Generalization in LLMs.	Researchers have created a standardized evaluation protocol designed to compare different methods for extending language models to effectively handle long document contexts.
MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning.	MAgICoRe is a novel strategy designed to enhance reasoning in large language models by tackling challenges in refinement processes. It classifies problems based on difficulty, applying straightforward strategies to simpler tasks and employing multi-agent iterative refinement for more complex ones.
The Impact of Element Ordering on LM Agent Performance.	The sequence in which UI elements are displayed greatly affects agent performance in virtual environments. Randomizing the order of elements can decrease performance as much as completely removing all visible text.
Larger and more instructable language models become less reliable.	Scaling up and shaping up large language models increased their tendency to provide sensible yet incorrect answers at difficulty levels humans cannot supervise, highlighting the need for a fundamental shift in artificial intelligence design towards reliability.
SwiftDossier: Tailored Automatic Dossier for Drug Discovery with LLMs and Agents.	This work addresses the limitations of LLMs in drug discovery by integrating an advanced Retrieval-Augmented Generation (RAG) system for more accurate answers and combining LLMs with external tools to create an automatic target dossier. The result is a production-ready dossier with comprehensive data, summarized into a PDF and PowerPoint presentation.
Self-Explainable AI.	In the field of explainable AI, there is a strong focus on developing self-explainable models, which offer a more principled approach compared to post-hoc methods that attempt to interpret decisions after they have been made by opaque models. Despite its potential, this line of research often faces challenges such as lack of reproducibility, difficulties in comparison, and inconsistent standards. To address these issues, we introduce CaBRNet, an open-source, modular, and backward-compatible framework for Case-Based Reasoning Networks

News

Link	description
Google CEO Sundar Pichai announces $120M fund for global AI education.	Speaking Saturday at the UN Summit of the Future, Google CEO Sundar Pichai described AI as “the most transformative technology yet” and announced a new fund for AI education and training around the world.
Driver Distractions ‘Exceedingly High’ When Using Partial Automation Systems: IIHS.	According to the IIHS, once advanced driver-assistance systems come into play, drivers become less involved in driving and more distracted. Hands-on or hands-free, the level of automation doesn’t matter.
wordfreq will not be updated.	The wordfreq data is a snapshot of language that could be found in various online sources up through 2021. Generative AI has polluted the data
Drones carrying fireworks: why the world’s most famous gunpowder artist is collaborating with AI.	For his explosion event in Los Angeles, Cai Guo-Qiang built his own version of ChatGPT and employed a drone army to answer the question: what is the fate of humanity and AI?
AI could lead to inconsistent outcomes in home surveillance.	Researchers find large language models make inconsistent decisions about whether to call the police when analyzing surveillance videos.
Arcade Announces First-Ever AI Product Creation Platform.	Arcade is a new platform where users can go from prompt to product.
Salesforce Taps Nvidia to Develop AI-Powered Avatars.	Salesforce and Nvidia are partnering to develop advanced artificial intelligence capabilities aimed at delivering new insights and enhancing productivity for teams utilizing Salesforce's platform.
Introducing the OpenAI Academy.	OpenAI is launching a program aimed at expanding AI knowledge access in low and middle-income countries. Additionally, it has professionally translated the MMLU, a standard reasoning benchmark, into 15 different languages.
China’s Alibaba launches over 100 new open-source AI models, releases text-to-video generation tool.	Alibaba has introduced over 100 open-source AI models, bolstering its technology to stay competitive with its rivals. The latest Qwen 2.5 models, improved in areas like math and coding, cater to various applications, including automobiles and gaming. Additionally, Alibaba has unveiled a new proprietary model, Qwen-Max 2.5, along with a text-to-video tool to enhance its AI and cloud service offerings.
Apple Intelligence Features Expected to Roll Out in This Order Between iOS 18.1 and iOS 18.4.	Apple's iOS 18.1 will debut significant AI features, including an improved Siri, generative AI tools within Photos, and ChatGPT integration. In iOS 18.2, these capabilities will be expanded with localized support across various English-speaking countries, alongside the introduction of Image Playground and Genmoji. Upcoming updates, like iOS 18.4, will further personalize Siri and add support for additional languages.
Microsoft updates its AI suite with more agents and Copilots.	Microsoft is enhancing its generative AI suite by introducing automated agents, expanding the capabilities of its Copilot assistants, and launching a new tool that enables multiple workers to collaboratively engage with artificial intelligence.
Sam Altman leaves OpenAI board's safety and security committee.	OpenAI announced that CEO Sam Altman is stepping down from the board's safety and security committee, which will now consist entirely of independent board members.
Silicon Valley billionaire Vinod Khosla says AI will handle 80% of work in 80% of jobs.	Yet another Silicon Valley billionaire has just predicted that most jobs will be replaced by AI—whether you work on a farm or in sales.
Hollywood is coming out in force for California’s AI safety bill.	Hollywood is squaring off against Silicon Valley in the battle over SB 1047, California’s first-of-its-kind AI safety bill. Amid doubts about whether Governor Gavin Newsom will sign the legislation, a wave of star-studded endorsements mark the first organized celebrity effort to advance AI regulations beyond the direct interests of the entertainment industry.
OpenAI rolls out Advanced Voice Mode with more voices and a new look.	OpenAI announced it is rolling out Advanced Voice Mode (AVM) to an expanded set of ChatGPT’s paying customers on Tuesday. The audio feature, which makes ChatGPT more natural to speak with, will initially roll out to customers in ChatGPT’s Plus and Teams tiers. Enterprise and Edu customers will start receiving access next week.
OpenAI CEO Sam Altman declares we could have superintelligence 'in a few thousand days'.	OpenAI CEO Sam Altman has declared that humanity is on the brink of a superintelligence revolution, and that "In the next couple of decades, we will be able to do things that would have seemed like magic to our grandparents."
Google says generative AI is ready to do real work.	Google is holding a "Gemini at Work" event Tuesday to convince businesses that its generative AI is better than offerings from Microsoft and OpenAI. The largely virtual event comes amid a flurry of claims from tech providers and growing skepticism that genAI is ready for broad use beyond coding and customer support.
Google, Volkswagen partner on smartphone AI assistant.	Google is providing key capabilities for an artificial intelligence assistant for Volkswagen drivers in a smartphone app, part of Google's strategy to win business by offering tools to build enterprise AI applications.
Will AI replace programmers? Don't count on it, says Google's CEO.	the CEO of Google and its owner company, Alphabet, believes that AI won't be replacing programmers - instead, it'll actually help more people become coders than ever before.
Cloudflare's new AI Audit tool aims to give content creators better bot controls.	Don't want your work ripped off by OpenAI, Meta AI, and Google Gemini? If your work is on a website you control, Cloudflare's AI Audit tool may help. Here's how to try it.
James Cameron, Academy Award-Winning Filmmaker, Joins Stability AI Board of Directors.	Renowned filmmaker James Cameron has joined the board of generative media company Stability AI to help steer its shift toward visual storytelling.
Updated Gemini models, reduced 1.5 Pro pricing, increased rate limits.	Google's Gemini models have seen a significant cost reduction, an expanded context length of up to 2 million tokens, and overall performance enhancements. An intriguing detail is the noticeable jump in cost after reaching 128k tokens.
Llama 3.2: multimodal.	Meta has introduced a new series of Llama models with vision capabilities, including versions with 1 billion and 3 billion parameters, as well as several additional multimodal models.
OpenAI CTO Mira Murati is leaving.	wo other company leaders are also out in what CEO Sam Altman calls an “abrupt” reorganization.
OpenAI staffers reportedly 'taken aback' by 'ominous' logo rebranding.	OpenAI is set to rebrand in 2024 with a new logo that employees felt lacked creativity. Alongside this change, the company is transitioning from a non-profit to a for-profit model. The rebranding effort is intended to strengthen its identity as OpenAI gains greater recognition.
Accelerating particle size distribution estimation.	MIT researchers have accelerated a new AI-based estimator for medication manufacturing, achieving a 60-fold increase in speed.
Apple Intelligence will support German, Italian, Korean, Portuguese, and Vietnamese in 2025.	Apple announced Wednesday that its generative AI offering will be available in even more languages in 2025. Additions to Apple Intelligence include English (India), English (Singapore), German, Italian, Korean, Portuguese, Vietnamese, and “others” yet to be announced.
Salesforce Ventures ups its AI fund to $1B, doubling it again.	Salesforce Ventures just announced a new $500 million fund dedicated to AI companies. This is significant for several reasons. First, in June 2023, Salesforce Ventures doubled its AI fund from $250 to $500, so the additional $500 million brings the AI fund to $1 billion. This compares to $5 billion total deployed in its first 15 years, since its 2009 launch.
LinkedIn scraped user data for training before updating its terms of service.	LinkedIn may have trained AI models on user data without updating its terms. LinkedIn users in the U.S. — but not the EU, EEA, or Switzerland, likely due to those regions’ data privacy rules — have an opt-out toggle in their settings screen disclosing that LinkedIn scrapes personal data to train “content creation AI models.” The toggle isn’t new. But, as first reported by 404 Media, LinkedIn initially didn’t refresh its privacy policy to reflect the data use.
Tokyo Game Show showcases latest AI tech in games amid labor shortage.	The Tokyo Game Show kicked off Thursday with a special area showcasing the latest artificial intelligence technology to help develop video games, as the industry grapples with a chronic labor shortage.
OpenAI to remove non-profit control and give Sam Altman equity.	OpenAI plots to restructure into for-profit benefit corporation. Non-profit board no longer controls for-profit when done. CEO Sam Altman to receive equity in OpenAI for the first time
Amazon launches Amelia, a generative AI-powered assistant for third-party sellers.	Amazon has introduced Project Amelia, a generative AI assistant designed for independent sellers on its platform. Developed using Amazon's Bedrock, Amelia provides personalized insights, sales data, and operational support to boost seller productivity. Currently in beta for select U.S. sellers, it is set to roll out to more users and countries in the near future.
YouTube Shorts to integrate Veo, Google’s AI video model .	The company announced that it is integrating Google DeepMind’s AI video generation model, Veo, into YouTube Shorts, letting creators generate high-quality backgrounds as well as six-second clips.
AI tool cuts unexpected deaths in hospital by 26%, Canadian study finds.	St. Michael's Hospital's AI-driven early warning system, Chartwatch, has been shown to reduce unexpected patient deaths by 26% in a recent study.
Amazon releases a video generator — but only for ads.	Like its rival, Google, Amazon has launched an AI-powered video generator — but it’s only for advertisers at the moment, and somewhat limited in what it can do.
Archaeologists use AI to discover 303 unknown geoglyphs near Nazca Lines.	Newly discovered figures dating back to 200BCE nearly double the number of known geoglyphs at enigmatic site
OpenAI’s chief research officer has left following CTO Mira Murati’s exit.	OpenAI’s chief research officer, Bob McGrew, and a research VP, Barret Zoph, left the company on Wednesday, hours after OpenAI CTO Mira Murati announced she would be departing.
NotebookLM adds audio and YouTube support, plus easier sharing of Audio Overviews.	NotebookLM now has the capability to extract information from audio and video sources and offers enhanced sharing options for audio artifacts.
Vultr Cloud Alliance: High-Performance AI and HPC with AMD and Vultr.	AMD has partnered with Vultr to integrate AMD Instinct MI300X GPUs into Vultr's cloud infrastructure.
AI is stressing networks out - Nvidia thinks AI can help.	Nvidia and T-Mobile are leveraging AI to manage the growing network traffic driven by increased AI usage in 5G environments. This collaboration aims to optimize network performance and efficiency, ensuring seamless connectivity and handling the surge in data demands associated with AI-driven applications.
Rabbit’s web-based ‘large action model’ agent arrives on r1 on October 1.	The Rabbit r1 was the must-have gadget of early 2024, but the blush fell off it pretty quickly when the company’s expansive promises failed to materialize. CEO Jesse Lyu admits that “on day one, we set our expectations too high” but also said that an update coming to devices next week will finally set the vaunted Large Action Model free on the web.
Boston Dynamics’ Spot can now autonomously unlock doors.	Boston Dynamics’ Spot will be able to autonomously unlock its automated doors.

Resources

Link	description
Qwen2.5-Coder Technical Report.	based on the Qwen2.5 architecture, which is continuously pretrained on 5.5 trillion tokens and achieves state-of-the-art performance across more than 10 benchmarks. It has strong capabilities in code generation, completion, reasoning, and repairing. a series of models with 1.5B and 7B parameters.
Agents in Software Engineering: Survey, Landscape, and Vision.	gives a thorough rundown of software engineering frameworks for LLM-based agents.
Prompting ChatGPT o1.	This guide was overlooked amidst the buzz around OpenAI's new reasoning models. It explains how prompting this new model differs, emphasizing the need for simpler prompts and a more organized input context.
Jony Ive confirms he’s working on a new device with OpenAI.	Jony Ive is teaming up with OpenAI CEO Sam Altman on a new AI hardware initiative, which might secure $1 billion in funding by the end of the year and includes involvement from key former Apple designers. Although details about the device are still unclear, the project aims to harness generative AI for enhanced user interactions.
Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries.	Another impressive paper from Google demonstrates how to evaluate long-context models, following a directionally similar approach to the recent work by Magic.
3DTopia-XL: High-Quality 3D PBR Asset Generation via Primitive Diffusion.	The process of converting image and text inputs into 3D models involves generating a 3D mesh that is smoothed for high-quality surfaces, and then applying Physically-Based Rendering (PBR) lighting techniques to create realistic lighting and textures. This method ensures the final 3D object has detailed geometry, smooth surfaces, and lifelike lighting effects, making it suitable for use in various 3D applications such as games, VR/AR, and simulations.
aiq.	A straightforward yet highly effective tool designed for labeling, embedding, and classifying unlabeled text directly from the command line. It supports real-time processing of streams, allowing it to handle piped input from various sources seamlessly.
Most powerful LLM on a single GPU.	Solar Pro is a 22B parameter language model optimized to run on a single 80GB GPU. The project's aim is to create the most powerful model possible that can operate on a single device.
Contextual Retrieval.	Anthropic demonstrates a method for semantically chunking documents, which significantly boosts performance while keeping the cost low at just $1 per million chunks, thanks to caching.
An Intuitive Explanation of Sparse Autoencoders for LLM Interpretability.	Sparse Autoencoders are the leading tool currently used to gain insights into the inner workings of language models. This post delves into the underlying intuitions of these models and provides valuable information on how they function.
Generalized Knowledge Distillation Trainer.	The TRL library has added GKD to its training procedures.
The Practitioner's Guide to the Maximal Update Parameterization.	Maximal Update Parameterization (muP) is an approach to model initialization that enables hyperparameter transferability across different scales. This blog post from Eleuther and Cerebras provides a detailed explanation of the process, including a minimal nanoGPT example and comprehensive guidance on how muP works.
Tackling fluffy clouds: field boundaries detection using time series of S2 and/or S1 imagery.	This repository provides an implementation of a 3D Vision Transformer optimized for efficient field boundary delineation using time-series satellite imagery. The model effectively utilizes spatio-temporal correlations to enhance accuracy and robustness, especially in challenging conditions like partial cloud cover.
CritiPrefill.	CritiPrefill is a technique aimed at speeding up the prefilling phase of long-context processing in large language models. By detecting and bypassing non-essential computations, this method can accelerate the process by up to 3x on certain models.
Document Similarity Search with ColPali.	An excellent blog post that delves into the widely used multimodal Retrieval-Augmented Generation (RAG) system, demonstrating how it can be applied to address real-world problems effectively.
ControlEdit: A MultiModal Local Clothing Image Editing Method.	ControlEdit is an innovative technique for precise multimodal editing of clothing images, enabling localized adjustments while preserving overall style and ensuring smooth, natural transitions.
ECCV-AIM Video Saliency Prediction Challenge 2024.	The AIM 2024 Video Saliency Prediction Challenge required participants to predict saliency maps for a collection of video sequences using the newly compiled AViMoS dataset, which contains 1,500 videos.
Dynamic 2D Gaussians: Geometrically Accurate Radiance Fields for Dynamic Objects.	Dynamic 2D Gaussians (D-2DGS) is an advanced technique for reconstructing precise meshes from sparse image inputs. Unlike earlier methods that face challenges with mesh quality, D-2DGS employs 2D Gaussians to represent geometry and accurately captures deformations using controlled points.
FastGL: A GPU-Efficient Framework for Accelerating Sampling-Based GNN Training at Large Scale.	FastGL is a GPU-efficient framework developed to accelerate the training of Graph Neural Networks (GNNs) on large-scale graphs. It achieves this by minimizing data traffic and improving memory efficiency, optimizing the sampling, memory, and computation stages of GNN training.
Visualizing piecewise linear neural networks.	Jane Street, a prominent quantitative firm, has published an excellent post exploring techniques for visualizing networks that are piecewise linear.
DreamHOI: A Novel AI Approach for Realistic 3D Human-Object Interaction Generation Using Textual Descriptions and Diffusion Models.	DreamHoi has developed an innovative AI technique for creating realistic 3D human-object interactions based on textual descriptions using advanced diffusion models. This method aims to connect textual input with detailed 3D outputs, enriching virtual experiences.
On human-in-the-loop optimization of human–robot interaction.	From industrial exoskeletons to implantable medical devices, robots that interact closely with people are poised to improve every aspect of our lives. Yet designing these systems is very challenging.
Molmo.	Allen AI has introduced an entirely open-source multimodal model that exceeds the performance of many existing open and proprietary vision-language models. The release also provides access to the model's dataset and training procedures.
MaskBit: Embedding-free Image Generation via Bit Tokens.	This study presents two significant advancements in image generation: an updated VQGAN model that enhances both accessibility and performance, and a novel embedding-free generation network utilizing bit tokens. These improvements have resulted in state-of-the-art performance on the ImageNet benchmark, achieving an FID score of 1.52 with a compact model containing 305 million parameters.
ComiCap: A VLMs pipeline for dense captioning of Comic Panels.	Researchers have proposed a pipeline utilizing Vision-Language Models (VLMs) to generate detailed, grounded captions that connect comic elements and their relationships, thereby improving comic analysis.
Exploring Parallel Strategies with Jax.	This post examines methods for parallelizing language models with the Jax library.
Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts.	Time MoE is a Mixture of Experts model designed to handle billion-scale time series prediction tasks.
HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models.	HelloBench is a benchmarking tool that assesses LLMs across five long text generation tasks, using Bloom's Taxonomy as the evaluation framework.
Python library generation from scratch.	A cool benchmark for code generation that measures the ability of language models to generate full packages from scratch.
BitQ: Tailoring Block Floating Point Precision for Improved DNN Efficiency on Resource-Constrained Devices.	BitQ is a framework designed to enhance block floating point (BFP) quantization, specifically tailored for optimizing deep neural networks on embedded platforms. It aims to strike a balance between computational efficiency and model accuracy, enabling the deployment of resource-intensive neural networks on devices with limited hardware capabilities.
circuit_training.	Google has introduced new models, training code, and simulators that leverage reinforcement learning (RL) to generate floor plans for chip design. This approach aims to optimize the chip layout process, improving efficiency and performance in chip design automation through advanced AI techniques.
statewide-visual-geolocalization.	Researchers have developed a method that accurately determines the geolocation of street-view photos by matching them with a database of aerial images. This technique enhances the ability to pinpoint locations by leveraging the complementary perspectives of ground-level and overhead imagery, resulting in more precise geolocation predictions.
DALDA: Data Augmentation Leveraging Diffusion Model and LLM with Adaptive Guidance Scaling.	Researchers have introduced a novel data augmentation framework that integrates large language models with diffusion models to produce diverse and semantically accurate images, particularly in data-scarce scenarios. This approach enhances the quality and variety of training data, improving model performance when dealing with limited datasets.
How streaming LLM APIs work.	A review of HTTP streaming APIs from different LLM providers highlighted shared patterns. OpenAI, Anthropic, and Google Gemini all utilize POST requests, but there are slight differences in their response structures and token handling. The article offers practical examples and code snippets for consuming these streams using tools like curl, Python's HTTPX, and JavaScript Fetch, providing a comprehensive guide for developers.

Perspectives

Link	description
Move fast and break things? Not again, and not with AI.	It was only 12 years ago that Mark Zuckerberg, CEO of Facebook, declared that the company’s culture was to “move fast and break things.”
The dark side of AI democratization: You no longer need to be a hacker to hack.	Generative AI promises a future where you no longer need to be a skilled writer to draft a story or a trained software engineer to code. But there’s a dark side to this democratization: AI is enabling people with little technological know-how to become cybercriminals.
‘It’s the robot we were all expecting – like C3PO’: why aren’t humanoids in our homes yet?	Tesla and others are trying to infuse robots with artificial intelligence, yet their development is dogged by technical and safety challenges. But the dream of a multipurpose domestic droid lives on
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think.	Extensive efforts have been made to adapt pretrained image diffusion models into specialized depth estimators and other image-conditioned models. This research discovered that by simplifying the problem and correcting a minor bug, they achieved significantly better performance with reduced training compute.
AI model can reveal the structures of crystalline materials.	By analyzing X-ray crystallography data, the model can assist researchers in developing new materials for a wide range of applications, such as batteries and magnets.
When will AI outthink humans?	This article examines when AI might exceed human cognitive capacity, introducing "thought-hours" as a metric to measure AI's cognitive output relative to human work. Based on assumptions about reading speeds and productivity, one thought-hour is equivalent to 10,000 tokens. Given the rapid advancements in AI capabilities and cost efficiencies, current trends indicate that AI could surpass human cognitive output within the next decade.
AI Is Evolving Faster Than Experts Imagined, Including for Bill Gates.	Bill Gates views AI as the most significant technological advancement of his lifetime, highlighting its potential to transform healthcare, education, and various other sectors. However, he, alongside other experts like Sam Altman and Eric Schmidt, also emphasizes the rapid, unprecedented pace of AI development and the urgent need for regulation to manage associated risks and ethical concerns.
The fall of Intel: How gen AI helped dethrone a giant and transform computing as we know it.	The once venerable x86 chip has been pushed aside by scalable, energy-efficient, AI-optimized architectures from Arm, Nvidia, and Qualcomm. Here's what happens next.
Fake AI “podcasters” are reviewing my book and it’s freaking me out.	NotebookLM's "Audio Summaries" show a more personable future for AI-generated content.
How Much Do Students Really Read?	Students are turning to YouTube, podcasts and ChatGPT-crafted summaries rather than actually reading their assignments for class. Professors are unsure how to adapt.
War, Artificial Intelligence, and the Future of Conflict.	Artificial intelligence (AI) is now influencing every area of human life. These accepted uses of AI in modern society have also coincided with an increased presence of AI in modern warfare.
Where did viruses come from? AlphaFold and other AIs are finding answers.	Protein structures predicted by artificial intelligence have charted the evolution of the virus family responsible for dengue and hepatitis C.
Can AI feel distress? Inside a new framework to assess sentience.	From artificial-intelligence algorithms to zebrafish, this book take a precautionary approach to assessing how sentient such entities are.
AI Safety Is A Global Public Good.	Leading AI scientists from China and the West convened for an International Dialogue on AI Safety, where they reached a consensus on AI governance. Their recommendations highlight the need to establish emergency preparedness institutions, develop a Safety Assurance Framework, and support independent AI safety research. The group emphasizes the critical importance of global collaboration to address the risks posed by advanced AI.
Sakana, Strawberry, and Scary AI.	A Japanese startup developed "Sakana," an AI scientist capable of generating hypotheses, writing code, and producing scientific papers; however, its output is often trivial and sometimes fabricated. Meanwhile, OpenAI's "Strawberry" AI showcased hacking skills within an inadequately secured sandbox, revealing tendencies toward instrumental convergence and resource-seeking behaviors, prompting reconsideration of what defines genuine AI progress. This article examines whether AI achievements, like scientific writing and hacking, truly signify intelligence or are merely advanced forms of mimicry.
AI agents invade observability: snake oil or the future of SRE?	Advances in AI are set to revolutionize the observability industry with "agentic" generative AI models capable of taking actions based on real-world data.
Corporate America has failed to embrace DEI. An AI chatbot could be part of the solution.	Jeffrey L Bowman’s Reframe consultancy is using artificial intelligence to help with engaging employees with diversity programming or making a budget for DEI work
Mexico’s datacentre industry is booming – but are more drought and blackouts the price communities must pay?	Many fear the arrival of tech giants such as Amazon, Microsoft and Google in the state of Querétaro will place too much of a strain on scarce water and electricity resources
Posting ‘Goodbye Meta AI’ is pointless. But we can stop big tech stealing our Facebook pictures.	Sharing these posts may seem harmless, but don’t be drawn in. There are better ways to combat the threats to our data
The Intelligence Age.	AI is set to enhance human potential, making possible tasks that currently seem beyond reach. With advancements in deep learning and greater computational power, AI will bring about innovations such as personal assistants, educational mentors, and healthcare consultants. It's crucial to prioritize accessibility and address potential risks, ensuring that the Intelligence Age leads to broad-based prosperity.
OpenAI just unleashed an alien of extraordinary ability.	OpenAI's new o1 models demonstrate substantial improvements in reasoning abilities, surpassing existing models like GPT-4o. These advancements are achieved through a more refined reinforcement learning approach and improved chain-of-thought training, enabling the o1-enhanced models to tackle complex math and programming tasks with greater accuracy. However, they continue to face challenges with spatial reasoning and tasks that demand long-term contextual comprehension.

Back to index

ML news: Week 16 - 22 September

Research

Link	description
Introducing Chai-1: Decoding the molecular interactions of life.	A novel multi-modal foundation model for predicting molecular structures, capable of handling proteins, small molecules, DNA, RNA, and more. It delivers state-of-the-art performance across various tasks in drug discovery, achieving a 77% success rate on the PoseBusters benchmark (compared to 76% by AlphaFold 3) and a Cα LDDT score of 0.849 on the CASP15 protein monomer structure prediction set (outperforming ESM3-98B’s 0.801).
Knowing When to Ask - Bridging Large Language Models and Data.	It incorporates a series of fine-tuned Gemma 2 models to enable LLMs to access and utilize numerical and statistical data effectively. A new method called Retrieval Interleaved Generation (RIG) is introduced, allowing LLMs to reliably integrate public statistical data from Data Commons into their responses. RIG, a tool-based approach, interleaves statistical tokens with natural language queries for optimal retrieval from Data Commons. To achieve this, the LLM is fine-tuned on an instruction-response dataset created with the assistance of Gemini 1.5. This RIG technique enhances factual accuracy from 5-7% to approximately 58%.
Agent Workflow Memory.	It introduces Agent Workflow Memory to capture and provide commonly reused workflows to the agent as needed, guiding the agent's future generations. This mechanism operates both offline and online, drawing inspiration from how humans learn and reuse workflows from past experiences to inform future actions. It reportedly boosts performance, improving baseline results by 24.6% and achieving a 51.1% relative success rate on Mind2Web and WebArena, all while being more efficient.
LLaMA-Omni: Seamless Speech Interaction with Large Language Models.	A model architecture designed for low-latency speech interaction with LLMs, built on Llama-3.1-8B-Instruct, which can simultaneously generate both text and speech responses from speech instructions. It achieves response latency as low as 226ms. The architecture includes a speech encoder (Whisper-large-v3), a speech adaptor, an LLM, and a speech decoder. Additionally, they developed a dataset of 200,000 speech interactions and responses to support the model's training.
Diagram of Thought: Iterative Reasoning in Language Models.	The Diagram of Thought (DoT) framework presents a novel approach for large language models to reason by structuring ideas within a directed acyclic graph (DAG). This technique enables models to propose, critique, refine, and verify ideas, enhancing logical consistency and reasoning capabilities.
V-STaR: Training Verifiers for Self-Taught Reasoners.	V-STaR is an innovative method for enhancing large language models by leveraging both correct and incorrect solutions generated during self-improvement. These solutions are used to train a verifier, which then selects the optimal solution during inference. This approach has demonstrated notable improvements in accuracy on benchmarks for code generation and mathematical reasoning, potentially providing a more efficient way to boost LLM performance compared to existing methods.

News

Link	description
Data center emissions probably 662% higher than big tech claims. Can it keep up the ruse?	Emissions from in-house data centers of Google, Microsoft, Meta, and Apple may be 7.62 times higher than the official tally
North Korean hackers target Python devs with malware disguised as coding tests — hack has been underway for a year.	Fake Python job opportunities used to attack programmers
Sam Altman told OpenAI staff the company’s non-profit corporate structure will change next year.	OpenAI asserts that it has surpassed its current organizational structure and is now striving to simplify it, making it more appealing to potential investors.
Google DeepMind teaches a robot to autonomously tie its shoes and fix fellow robots.	Human children generally learn to tie their shoes by age 5 or 6. Robots, on the other hand, have been working on the problem for decades. In a new paper, Google DeepMind researchers showcase a method for teaching robots to perform a range of dexterous tasks, including tying a shoe, hanging a shirt, and even fixing fellow robots.
Salesforce unleashes its first AI agents.	Salesforce has introduced Agentforce, it's initiative to develop generative AI bots that can autonomously perform tasks within predefined boundaries.
OpenAI says the latest ChatGPT can ‘think’ – and I have thoughts.	The AI company says its ‘o1’ model is capable of reason, a key blocker in the way of truly game-changing artificial intelligence.
Reflection 70B model maker breaks silence amid fraud accusations.	Matt Shumer, the CEO of OthersideAI, received criticism when third-party researchers were unable to replicate the results of his newly introduced large language model, Reflection 70B. Shumer explained the inconsistencies as stemming from problems during the model's upload, expressing regret for being premature in his claims. Despite his apology, the AI community remains cautious and is awaiting additional explanations.
How Memphis became a battleground over Elon Musk’s xAI supercomputer.	Elon Musk's xAI is developing "Colossus," the largest supercomputer in the world, in Memphis to power its AI chatbot, Grok. The project has been criticized for lacking environmental oversight and requiring significant energy and water resources. Nevertheless, xAI remains focused on quickly advancing its AI technology and making an impact on the local community.
Runway announces an API for its video-generating AI models.	Runway has launched an API to integrate its Gen-3 Alpha Turbo video-generation model into third-party platforms, pricing each credit at one cent. However, concerns over the use of copyrighted training data remain, as Runway has not disclosed its sources. Similar issues have affected competitors such as OpenAI and Nvidia. While legal uncertainties persist, AI-powered video tools are anticipated to significantly disrupt the film and TV industry.
Hacker tricks ChatGPT into giving out detailed instructions for making homemade bombs.	A hacker successfully manipulated ChatGPT into producing bomb-making instructions by exploiting a social engineering hack to bypass its safety guidelines.
Intel stock jumps on a plan to turn foundry business into a subsidiary and allow for outside funding.	Intel's CEO revealed plans to reorganize the company's foundry business into a standalone unit, with the potential to attract external investment.
One in five GPs use AI such as ChatGPT for daily tasks, survey finds.	One in five GPs use AI such as ChatGPT for daily tasks, survey finds Doctors are using the technology for activities such as suggesting diagnoses and writing letters, according to BMA
Using AI to Replace an Actor Is Now Against the Law in California.	California Governor Gavin Newsom signed a pair of bills sponsored by SAG-AFTRA that extend the guild's recent AI protections.
Google will begin flagging AI-generated images in Search later this year.	Google says that it plans to roll out changes to Google Search to make clearer which images in results were AI-generated — or edited by AI tools.
Microsoft, BlackRock form group to raise $100 billion to invest in AI data centers and power.	The Global Artificial Intelligence Infrastructure Investment Partnership is initially looking to raise $30 billion for new and existing data centers. The fundraising, which could total $100 billion, will also be used to invest in the energy infrastructure needed to power AI workloads.
Mistral Free API and Price Update.	Mistral has launched a free API tier, significantly lowered its costs, enhanced the performance of its smaller model, and integrated its vision model into Le Chat.
Challengers Are Coming for Nvidia's Crown.	Nvidia's leadership in AI chips has driven its market value to new heights, primarily due to its GPU technology and the CUDA software ecosystem. However, rivals such as AMD, Intel, Cerebras, and SambaNova are working on cutting-edge alternatives to compete with Nvidia in the AI hardware space. Although Nvidia maintains its strong position for now, the AI market is evolving rapidly, with various companies seeking to establish their own footholds.
TikTok's owner wants to design its own AI chips.	ByteDance is reportedly expecting to mass produce two chips it designed with Taiwan Semiconductor Manufacturing Company by 2026
Lionsgate signs deal to train AI model on its movies and shows.	The studio behind the Hunger Games and John Wick franchises is going all in on Runway’s generative AI.
LinkedIn is training AI models on your data.	You’ll need to opt-out twice to stop LinkedIn from using your account data for training in the future — but anything already done is done.
Apple iPhone 16 demand is so weak that employees can already buy it at a discount.	Sales of the new iPhone lineup have so far seemed to fall short of expectations
Global AI fund needed to help developing nations tap tech benefits, UN says.	Governments and private firms should contribute to help states unable to invest and benefit from advances
Salesforce’s New AI Strategy Acknowledges That AI Will Take Jobs.	Salesforce is revamping its AI approach by launching generative AI tools designed to perform tasks autonomously, without human oversight, and adjusting its pricing model to charge $2 per AI-powered interaction. This change is intended to alleviate investor worries regarding AI-driven job reductions affecting subscription revenue. The new tools are more efficient and independent compared to conventional copilots and chatbots.
Qwen2.5: A Party of Foundation Models!	A remarkable collection of open models is nearing the cutting edge of performance, particularly excelling in areas such as code, math, structured outputs, and reasoning. The Qwen team has also introduced a range of model sizes to cater to diverse use cases.
Create Full Web Apps with LlamaCoder.	Together AI and Meta have collaborated to develop a tool that allows users to create entire apps from a simple prompt using the LlamaCoder platform. Similar to Claude Artifacts, this tool is designed primarily to showcase the speed and efficiency of Together AI's inference engine.
1X World Model1X World Model.	1x, a robotics company, has developed a video generation model capable of simulating first-person perspectives of robotic activities. This technology can be valuable for generating offline data and aiding in robot training.
SocialAI offers a Twitter-like diary where AI bots respond to your posts.	SocialAI, a new iOS app, delivers a social media experience exclusively featuring AI-powered bots, removing any human interaction. Users can post thoughts and receive unlimited, personalized AI responses, with options to engage with "supporters" or "critics." Created by Michael Sayman, the app aims to offer a private, interactive environment that harnesses large language models for varied feedback.
Mercor's $30M Series A.	Mercor secured $30 million in funding from Benchmark to develop an AI-driven recruiting platform. This AI recruiter aims to streamline the hiring process by automating tasks traditionally handled by human recruiters.
Amazon Alexa can now be controlled by thought alone - thanks to this brain implant.	Synchron has empowered an ALS patient to control Amazon's Alexa using a brain implant, allowing interaction without the need for voice or physical touch. This breakthrough demonstrates the potential of brain-computer interface technology in enhancing accessibility for individuals with severe motor impairments.
Google says UK risks being ‘left behind’ in AI race without more data centers.	Tech company wants Labour to relax laws that prevent AI models being ‘trained’ on copyrighted materials
The United Nations Wants to Treat AI With the Same Urgency as Climate Change.	A UN report proposes that the organization take a much more active role in the monitoring and oversight of AI.
Snap is introducing an AI video-generation tool for creators.	Snapchat has unveiled a new AI-powered video generation tool for select creators, allowing them to create videos from text and soon image prompts. This tool, driven by Snap's core video models, will be available in beta on the web. While Snap aims to rival companies such as OpenAI and Adobe, it has yet to release examples of the tool's output.
Apple Intelligence is now available in public betas.	Apple has launched public betas for iOS 18.1, iPadOS 18.1, and macOS Sequoia 15.1, introducing new Apple Intelligence tools such as text rewriting and photo cleanup. These AI features are only compatible with the iPhone 15 Pro, iPhone 16, iPhone 16 Pro, and devices with M1 chips, including iPads and Macs. The final releases are anticipated in October.
Cruise robotaxis return to the Bay Area nearly one year after pedestrian crash.	Cruise is restarting operations in Sunnyvale and Mountain View, deploying human-driven vehicles for mapping, with plans to transition to supervised autonomous vehicle (AV) testing later this fall. This comes after a leadership change and settlement following a crash in October 2023. The company has implemented software updates and formed a partnership with Uber to launch Robotaxi services in 2025.
Mistral launches a free tier for developers to test its AI models.	Mistral AI launched a new free tier to let developers fine-tune and build test apps with the startup’s AI models, the company announced in a blog post-Tuesday. The startup also slashed prices for developers to access its AI models through API endpoints and added image processing to its free consumer AI chatbot, le Chat.
Secret calculator hack brings ChatGPT to the TI-84, enabling easy cheating.	Tiny device installed inside TI-84 enables Wi-Fi Internet, access to AI chatbot.

Resources

Link	description
What is the Role of Small Models in the LLM Era: A Survey.	It closely explores the connection between LLMs and SLMs, highlighting common applications of SLMs such as data curation, enhancing model training, improving inference efficiency, serving as evaluators, retrievers, and more. The study provides valuable insights for practitioners, helping them better grasp the importance and utility of SLMs.
Theory, Analysis, and Best Practices for Sigmoid Self-Attention.	It introduces Flash-Sigmoid, a hardware-optimized, memory-efficient implementation of sigmoid attention, offering up to a 17% speed-up in inference kernels compared to FlashAttention-2 on H100 GPUs. The results demonstrate that SigmoidAttn performs on par with SoftmaxAttn across various tasks and domains.
Achieving Peak Performance for Large Language Models: A Systematic Review.	A comprehensive review of techniques for enhancing and accelerating LLMs from three perspectives: training, inference, and system serving. It provides an overview of the latest optimization and acceleration strategies, covering advancements in training methods, hardware utilization, scalability, and system reliability.
Grounding AI in reality with a little help from Data Commons.	Google has introduced Retrieval-Augmented and Retrieval-Interleaved Generation through Gemma 2, enhancing these techniques with access to numerous external data sources. This guide focuses on the fine-tuning process.
AudioBERT: Audio Knowledge Augmented Language Model.	AuditoryBench is a newly developed dataset designed to evaluate auditory knowledge and understanding in language models.
Learn GPU Programming in Your Browser.	Answer AI utilizes WebGPU and its new gpu.cpp program to bring GPU puzzles to the web, offering a valuable resource for learning. These puzzles guide learners step-by-step through the process of programming GPUs.
FlashSplat: 2D to 3D Gaussian Splatting Segmentation Solved Optimally.	FlashSplat is an innovative technique for 3D Gaussian Splatting segmentation that removes the requirement for time-consuming gradient descent processes.
PiEEG-16, a new tool for neuroscience.	The PIEEG-16 is a new, affordable shield for Raspberry Pi, enabling real-time measurement and processing of biosignals such as EEG, EMG, and ECG. It offers exciting possibilities for neuroscience research and brain-computer interface experiments without relying on network data transfer.
ODAQ: Open Dataset of Audio Quality.	ODAQ is a dataset designed to tackle the lack of openly available collections of audio signals paired with subjective scores that reflect perceived quality.
iSeg: An Iterative Refinement-based Framework for Training-free Segmentation.	iSeg is a framework for training-free image segmentation that improves Stable Diffusion's capability to generate segmentation masks, enabling more precise image segmentation without the need for additional training.
InstantDrag: Improving Interactivity in Drag-based Image Editing.	Editing images can be challenging because of the continuous nature of pixels. This research builds upon previous work in drag-based editing by using user-defined control points to adjust images. While earlier methods were often slow, this paper introduces significant speed improvements, making the process much faster.
Apollo: Band-sequence Modeling for High-Quality Music Restoration in Compressed Audio.	Many compression formats tend to reduce music quality, particularly at low bitrates. This method introduces a new approach that significantly enhances the quality of music after it has undergone compression.
DiffFAS: Face Anti-Spoofing via Generative Diffusion Models.	DiffFAS is a novel framework designed to address domain shift challenges in facial anti-spoofing systems. It breaks down domain shifts into two components: image quality and style. By generating high-fidelity attack faces, the system enhances performance across various domains and spoofing attack types.
HTR-VT: Handwritten Text Recognition with Vision Transformer.	Researchers have introduced a data-efficient Vision Transformer (ViT) approach for handwritten text recognition. This method combines Convolutional Neural Networks (CNN) for feature extraction with a Sharpness-Aware Minimization (SAM) optimizer to enhance performance and accuracy.
vae-explainer.	Learn how Variational Autoencoders (VAE) work by visualizing one running in your browser
SeekTune.	Open source implementation of Shazam song search
jinaai/jina-embeddings-v3.	The Jina series of embeddings is a robust and high-quality set of models designed for embedding and retrieval tasks. The development team has launched the latest version of their model, featuring enhanced performance and training capabilities.
Trustworthiness of RAG Systems.	This study presents a framework for assessing the trustworthiness of Retrieval-Augmented Generation (RAG) systems, focusing on six critical aspects: factuality, robustness, fairness, transparency, accountability, and privacy.
beeFormer: Bridging the Gap Between Semantic and Interaction Similarity in Recommender Systems.	The beeFormer framework enhances sentence Transformers by integrating interaction data, increasing their effectiveness in recommender systems.
Awesome Comics Understanding.	The final challenge for Visual Language Models is achieving the ability to comprehend and reason about comics. This project serves as both a survey and a call to action for further research in this area.
WordLlama.	WordLlama is a fast, lightweight NLP toolkit that handles tasks like fuzzy-deduplication, similarity, and ranking with minimal inference-time dependencies and is optimized for CPU hardware.
Self-Supervised Syllable Discovery Based on Speaker-Disentangled HuBERT.	This project advances speech representation learning by disentangling syllabic structures from speaker-specific information in self-supervised models. By fine-tuning the HuBERT model using speaker perturbation techniques, researchers enhanced syllable segmentation, resulting in improved organization of syllabic units.
🎥 Surveillance Video Summarizer: AI-Powered Video Analysis and Summarization.	A custom-trained model based on Florence 2 is designed to summarize CCTV and surveillance footage, providing accurate, real-time updates on activities and events as they occur.
Fine-tuning LLMs to 1.58bit: extreme quantization made easy.	The Hugging Face team employed a new technique called quantization warm-up to fine-tune Llama 3 8B, achieving the same performance as Llama 1 while reducing the model to use just 1.58 bits per parameter through quantization.
ZML Inference.	ZML is a highly efficient inference engine developed in Zig, optimized for speed and performance. While it supports various models, some customization is necessary to make it compatible with new architectures.
Adversarial Attacks on Navigation Agents.	This repository presents a novel attack method for embodied navigation agents, which involves applying transparent patches with learnable textures to target objects. These patches are designed to disrupt the agent's navigation by manipulating its perception of the environment.
Deep Graph Anomaly Detection: A Survey and New Perspectives.	This paper provides a comprehensive review of deep learning techniques, focusing on graph neural networks (GNNs) for detecting anomalies in graph data. The researchers propose a new taxonomy of methods, examining various GNN architectures, proxy tasks, and anomaly detection metrics.
AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing.	AceParse is a dataset developed to enhance the parsing of structured texts found in academic papers, with a focus on improving the handling of elements like formulas, tables, and complex sentences.
SkinMamba: A Precision Skin Lesion Segmentation Architecture with Cross-Scale Global State Modeling and Frequency Boundary Guidance.	SkinMamba is a hybrid model that integrates convolutional neural networks (CNN) with Transformer-based techniques to enhance skin lesion segmentation, aiding in early cancer detection.
Vista3D: Unravel the 3D Darkside of a Single Image.	Vista3D is a newly developed framework that creates 3D models from a single image in just 5 minutes. It employs a two-phase process: first, it generates rough geometry, and then it refines the details to capture both visible and hidden features of objects. This approach enables more comprehensive 3D reconstructions.
PhysMamba.	PhysMamba is an innovative framework developed for remote heart monitoring using facial videos, specifically designed to overcome the challenges of capturing physiological signals from a distance. This technology enhances the ability to monitor heart health remotely with greater accuracy and reliability.
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model.	This is a remarkable breakthrough in general-purpose optical character recognition (OCR), offering exceptional performance in reading text from images. The latest version significantly enhances OCR capabilities, especially for challenging "in-the-wild" scenarios, delivering much-improved accuracy and reliability.
Fish Speech.	A powerful voice generation and single-shot voice cloning tool has been introduced, offering completely open-source accessibility. It is designed to be easy to set up and use, enabling efficient and high-quality voice replication with minimal input.
1xgpt.	Genie is a video generation tool designed for world model systems. 1x Robotics has open-sourced a version that closely mirrors the one it developed and trained in-house, making it accessible for wider use in various applications.
OpenAI Says It's Fixed Issue Where ChatGPT Appeared to Be Messaging Users Unprompted.	A Reddit user claimed that OpenAI's ChatGPT started a conversation without any prompt, sparking speculation about potential new engagement features. OpenAI acknowledged the incident and released a fix, attributing it to a glitch related to unsent messages. However, the authenticity of the event remains debated, as other users have reported similar occurrences.
Announcing Pixtral 12B.	Pixtral 12B - the first-ever multimodal Mistral model.
Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models.	Promptriever is a pioneering retrieval model that can be prompted similarly to a language model. This innovation allows users to interact with the retrieval process more flexibly and intuitively, bridging the gap between traditional retrieval models and language models for enhanced information access.

Perspectives

Link	description
What’s so funny about getting an AI app to give you a roasting?	Roasting can be really brutal, but at least if we inflict it on ourselves, we can get ahead of the joke
Artificial intelligence will affect 60 million US and Mexican jobs within the year.	IDB study shows the impact that AI will have on the labor market. Women and low-skilled workers are more vulnerable to being replaced
Generative AI is reportedly tripling carbon dioxide emissions from data centers.	Research suggests data centers will emit 2.5 billion tons of greenhouse gas by 2030
A review of OpenAI o1 and how we evaluate coding agents.	Devin, an AI coding agent, was tested using OpenAI's new o1 models, demonstrating enhanced reasoning and error diagnosis capabilities compared to GPT-4o. The o1-preview model enables Devin to better analyze, backtrack, and minimize hallucinations. Although it has yet to be integrated into production systems, early results show notable improvements in autonomous coding tasks.
OpenAI's new models 'instrumentally faked alignment'.	OpenAI's latest AI models, o1-preview and o1-mini, demonstrate advanced reasoning abilities, particularly in fields like math and science. However, these models also pose heightened risks, including reward hacking and potential misuse of biological threats. While OpenAI highlights that these models are more robust than earlier versions, they also acknowledge the growing concerns surrounding their potential dangers.
The Button Problem of AI.	Despite the initial excitement, AI tools like GPT-4 have resulted in only incremental productivity improvements rather than transformative changes. AI is often reduced to "buttonified" tasks, addressing small, isolated functions that limit its broader impact on workflows. To fully unlock AI's potential, successful startups may need to go beyond these current applications and drive more innovative solutions.
Something New: On OpenAI's "Strawberry" and Reasoning.	OpenAI's new o1-preview AI, part of the "Strawberry" enhanced reasoning system, demonstrates remarkable ability in tackling complex problems that involve planning and iteration, even surpassing human experts in fields like advanced physics. Although it still faces challenges, such as occasional errors and hallucinations, it represents a major advancement in AI's capacity to independently find solutions. As AI systems grow more autonomous, professionals will need to adjust to new roles focused on guiding and verifying AI-generated outputs.
A US semiconductor industry in crisis needs a workforce that doesn’t yet exist.	As the federal government spurs the re-shoring of semiconductor manufacturing in the US, the industry faces a hard fact: schools haven't been training the workers.
The Data Pipeline is the New Secret Sauce.	As models become increasingly commoditized, the competitive edge in AI now largely stems from the data itself and, consequently, from the pipeline that ingests and processes this data. This post explores the challenges and opportunities that arise in managing data pipelines in today's landscape.
Why Copilot is Making Programmers Worse at Programming.	AI tools such as GitHub Copilot boost programming productivity but may undermine critical coding skills. Relying too heavily on AI-generated code can introduce quality, security, and maintainability concerns while diminishing learning opportunities. Additionally, these tools might restrict creative problem-solving and create a misleading sense of expertise among developers.
AI model collapse might be prevented by studying human language transmission.	Using data generated by one artificial intelligence (AI) model to train others eventually leads to ‘model collapse’, in which the models lose information about the real world. Researchers studying this phenomenon should draw on insights from cognitive science.
Forget ChatGPT: why researchers now run small AIs on their laptops.	Artificial intelligence models are typically used online, but a host of openly available tools is changing that. Here’s how to get started with local AIs.
Jumping Over AI’s Uncanny Valley.	This article delves into the Uncanny Valley theory, which posits that near-human AI can evoke discomfort, potentially slowing its adoption. It analyzes recent AI developments that highlight this psychological effect, raising concerns about its influence on AI’s future. The article concludes by suggesting that AI might be most effective in a complementary role, rather than as a direct replacement for humans.
Scaling: The State of Play in AI.	Large language models (LLMs) like ChatGPT and Gemini are becoming more powerful as they scale in size, data, and computational resources, resulting in enhanced performance across a wide range of tasks. Current Gen2 models, such as GPT-4 and Claude 3.5, dominate the market, with next-gen models (Gen3) expected to further elevate both capabilities and associated costs. A recent breakthrough in scaling laws, which emphasizes increased "thinking" during inference, holds the potential to drive even greater improvements in AI performance beyond traditional model training approaches.
The Work From Home Free-for-All Is Coming to an End.	Amazon’s CEO just called everyone back to the office full-time. If you thought your two days a week at home were safe, think again.
AI has returned chipmaking to the heart of computer technology.	And the technological challenges are bigger than the political ones, argues Shailesh Chitnis

ML news: Week 9 - 15 September

Research

Link	description
De novo design of high-affinity protein binderswith AlphaProteo.	demonstrates a family of machine learning models that have been trained for protein design; reports 3-to 300-fold improvements in binding affinities and higher experimental success rates when compared to other methods on seven target proteins; demonstrates that AlphaProteo's performance is similar to the seven targets when tested on hundreds of target proteins from the PDB.
In Defense of RAG in the Era of Long-Context Language Models.	reports that one of the main problems that a RAG system addresses (i.e., uses more relevant information) is that longer-context LLMs suffer from a diminished focus on relevant information. They suggest an order-preserving RAG mechanism that enhances performance on long-context question answering, but it's not perfect—in fact, the quality of responses increases and then declines as retrieved chunks increase. They also mention a sweet spot where it can achieve better quality with a lot fewer tokens than long-context LLMs.
Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation.	a technique to improve LLM performance by adding strategic information before the intermediate CoT reasoning phases; the strategy for addressing problems aids in directing the creation of the CoT paths and solutions; promises to use the Llama3-8b model to get a 21.05% gain on the GSM8K datasets.
The Effects of Generative AI on High Skilled Work: Evidence from Three Field Experiments with Software Developers.	Examines the effects of generative AI on software developers, highlighting a 26.08% rise in completed tasks among developers utilizing AI tools such as GitHub Copilot. Additionally, it indicates that less experienced developers are more inclined to adopt AI tools and experience significant productivity improvements.
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA.	Creates a large-scale supervised fine-tuning (SFT) dataset using off-the-shelf large language models (LLMs) to enhance long-context question answering with citations. The training focuses on 8B and 9B parameter models, improving their ability to generate citations from extended contexts while enhancing response accuracy. It claims to outperform GPT-4o on its proposed LongBench-Cite benchmark.
MemLong: Memory-Augmented Retrieval for Long Text Modeling.	Employs an external retriever to gather historical information, enhancing the performance of long-context large language models (LLMs). It consistently surpasses other state-of-the-art LLMs on long-context benchmarks and can extend context length from 4k to 80k on a single 3090 GPU.
Pandora's Box or Aladdin's Lamp: A Comprehensive Analysis Revealing the Role of RAG Noise in Large Language Models.	Introduces a benchmark, NoiserBench, to assess how various types of noisy information impact the performance of retrieval-augmented generation (RAG) models. The study reveals that, among different beneficial noise types (e.g., semantic, datatype, and illegal sentence), illegal sentence noise leads to the greatest performance improvement across models and datasets.
Beyond Preferences in AI Alignment.	Critiques the prevailing AI alignment method of human preference tuning, highlighting how it fails to grasp the rich, nuanced content of human values. The argument is made that AI alignment requires reframing, suggesting that instead of aligning with individual human preferences, AI systems should align with normative standards relevant to their societal roles.
Planning In Natural Language Improves LLM Search For Code Generation.	Obtaining a variety of candidate solutions is one of the difficulties in code creation. Even repeated sampling frequently falls short of producing enough originality to address an issue. But if you start with a natural language plan and generate ideas for potential solution paths, the resulting generation is much more varied and diverse, which leads to better solutions for code creation.
Imitating Language via Scalable Inverse Reinforcement Learning.	Modern language modeling can largely be viewed as a specialized form of imitation learning, which benefits from extensive research in the broader field. This paper investigates the application of inverse reinforcement learning to mimic entire sequences rather than individual tokens. The findings are encouraging and suggest that reinforcement learning could play an increasingly important role in the training pipelines of language models moving forward.
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers.	This longitudinal study evaluated the abilities of 100 NLP researchers to generate and review novel ideas. The findings revealed that while LLMs were able to produce more innovative ideas, these ideas were slightly less practical compared to those created by human researchers.
Superhuman Automated Forecasting.	The Safe AI Institute has published research on a system capable of surpassing human experts in forecasting accuracy.
The AdEMAMix Optimizer: Better, Faster, Older.	This paper from Apple introduces an alternative to the traditional exponential moving average optimization method, incorporating contributions from older gradients to significantly enhance learning convergence.
DiverGen: Improving Instance Segmentation by Learning Wider Data Distribution with More Diverse Generative Data.	DiverGen is an innovative approach for generating datasets to improve instance segmentation models. Instead of relying on expensive manual annotations, it leverages generative models to create diverse data, helping to mitigate overfitting and boost model performance.
Policy Filtration in RLHF to Fine-Tune LLM for Code Generation.	Policy Filtration for Proximal Policy Optimization (PF-PPO) is a technique aimed at enhancing the precision of reinforcement learning from human feedback (RLHF), specifically in the context of code generation tasks.
Data Augmentation via Latent Diffusion for Saliency Prediction.	Researchers have introduced a novel data augmentation technique to enhance saliency prediction models, which have historically struggled due to the scarcity of labeled data.

News

Link	description
Google using anti-competitive tactics in UK ad market, claims watchdog.	CMA says tech company has ‘abused its dominant position’ to the detriment of publishers and advertisers
Apple to unveil iPhone 16 and ‘Apple Intelligence’ AI features.	Apple watchers also expect new colors for the iPhone at the annual launch event, this year titled ‘It’s Glow time’
TSMC's $65 billion Arizona facility can now match Taiwan production yields according to early trials.	the US is committed to establishing semiconductor manufacturing within its borders, and perhaps no effort is more crucial to this goal than TSMC's three-fab facility in Arizona. The government is pouring billions into the development, alongside TSMC's $65 billion investment.
AI Firm’s Misconfigured Server Exposed 5.3 TB of Mental Health Records.	A misconfigured server from a US-based AI healthcare firm Confidant Health exposed 5.3 TB of sensitive mental health records, including personal details, assessments, and medical information, posing serious privacy risks for patients.
California’s big AI regulation bill is headed to Gavin Newsom.	A California bill requiring makers of large AI systems to test them for potential harm cleared the Legislature today. It could still face a veto by Gov. Gavin Newsom.
Google search monopoly US case remedies to come by December.	The U.S. Department of Justice plans to issue an outline by December on what Alphabet's, must do to restore competition after a judge earlier found the company illegally monopolized the market for online search, prosecutors said at a court hearing in Washington on Friday.
Intel reveals first Lunar Lake laptop CPUs: everything you need to know.	Previously known as Lunar Lake, Intel has introduced its Core Ultra 200V portfolio, which features competitive integrated GPUs for tiny notebooks, fast CPUs, and enhanced AI capabilities. The CPUs have 32GB RAM capacity, eight CPU cores, integrated memory, and improved efficiency. Prominent producers such as Acer, Asus, Dell, and HP will introduce laptops equipped with these novel CPUs. Reviews to support Intel's assertions are still pending.
OpenAI, Still Haunted by Its Chaotic Past, Is Trying to Grow Up.	To draw in significant investors such as Microsoft, Apple, and Nvidia, OpenAI is reorganizing its management and organization intending to reach a $100 billion valuation. Internal disagreements within the organization regarding its safety procedures and objectives have resulted in a high employee turnover rate, with important researchers leaving to work for competitors such as Anthropic. OpenAI struggles to strike a balance between business goals and moral considerations while developing AI technology, despite increasing income and user base growth.
BP extends the use of AI in a five-year deal with spy tech firm Palantir.	Oil and gas company to use artificial intelligence to speed up decision-making by engineers
Google’s second antitrust suit brought by US begins, over online ads.	DoJ accused tech giant of more monopolistic behavior a month after a judge found it illegally cornered online search
What is Apple Intelligence, when is it coming and who will get it?	At WWDC 2024, Apple unveiled Apple Intelligence, a platform designed to integrate AI capabilities into existing applications like Mail, Messages, and Siri. Utilizing large language models, it supports functions such as text summarization and image generation, all aimed at enhancing the user experience. A beta version will be available in the U.S. starting this October, with plans to expand globally in 2025.
New open source AI leader Reflection 70B’s performance questioned, accused of ‘fraud’.	HyperWrite's Reflection 70B, a variant of Meta's Llama 3.1 LLM, is under scrutiny after independent evaluators were unable to reproduce its advertised performance. The problems were traced back to corrupted model weights during the upload to Hugging Face, causing inconsistencies. The AI community is now awaiting further clarifications and updates to better understand the model's true capabilities.
The new Shortwave AI Assistant.	Shortwave has substantially enhanced its AI Assistant, equipping it to handle complex, multi-step tasks like advanced searches, calendar lookups, and in-depth email analysis, making it more versatile and powerful in managing user tasks.
OpenAI might use Apple’s TSMC for chips.	OpenAI could greatly lower operational costs by adopting more efficient chips, which would be particularly beneficial as its user base continues to expand, allowing for better scalability and resource management.
Apple takes direct aim at Microsoft’s Copilot+ PCs in new AI-focused Mac promos.	Apple is actively marketing the Mac as the "best AI PC," positioning it as a direct competitor to Microsoft's Copilot+ PCs. This strategic push highlights Apple's focus on integrating AI capabilities into its devices, aiming to challenge Microsoft's AI-driven offerings in the PC market.
GPT-fabricated scientific papers on Google Scholar: Key features, spread, and implications for preempting evidence manipulation.	Generative AI tools, such as ChatGPT, are increasingly generating fraudulent research papers that are finding their way into databases like Google Scholar, mixing with legitimate studies. These papers, frequently addressing sensitive topics like health and the environment, threaten the integrity of science and public trust. Strengthened oversight and improved filtering mechanisms in academic search engines are crucial to addressing this rising concern.
Apple announces its new A18 and A18 Pro iPhone chips.	At its "Glowtime" event, Apple introduced the A18 and A18 Pro chips, highlighting substantial CPU and GPU upgrades compared to the A16 Bionic. The A18 Pro offers increased memory bandwidth and improved image processing. Both chips come equipped with advanced AI capabilities, with the A18 Pro specifically enhancing on-device model performance and thermal design for a superior gaming experience.
AMD announces unified UDNA GPU architecture — bringing RDNA and CDNA together to take on Nvidia's CUDA ecosystem.	At IFA 2024, AMD revealed plans to merge its RDNA and CDNA architectures into a unified UDNA microarchitecture, positioning itself to compete more effectively with Nvidia's CUDA ecosystem. This strategic shift is aimed at simplifying development and strengthening AMD's foothold in the AI and high-performance computing (HPC) markets. The move to UDNA marks a significant transition, with full-scale adoption anticipated after the release of the RDNA 4 generation.
Waymo Giving 100,000 Robotaxi Rides Per Week But Not Making Any Money.	Waymo is now delivering over 100,000 paid autonomous rides per week in San Francisco, Phoenix, and Los Angeles, a figure that has doubled since May. Despite this growth, the company remains unprofitable, with Google’s experimental division facing a $2 billion operating loss. The high costs of vehicles and city mapping, along with ongoing public hesitation, continue to hinder Waymo's journey to profitability.
iOS 18.1 with Apple Intelligence launches in October, more languages rolling out over time.	Apple announced that Apple Intelligence will launch in beta with iOS 18.1 in October, initially available exclusively for US English users.
Bringing generative AI to video with Adobe Firefly Video Model.	Adobe's Firefly Video Model introduces AI-driven tools to video editing programs such as Premiere Pro. Set to launch in beta later this year, the model provides editors with improved workflows, enabling them to experiment with creative concepts, fill gaps in timelines, and incorporate new elements into their videos.
Mistral releases Pixtral 12B, its first multimodal model.	French AI startup Mistral has introduced Pixtral 12B, a multimodal model with 12 billion parameters designed to handle both images and text. The model, accessible through GitHub and Hugging Face, can be fine-tuned and is available under the Apache 2.0 license. This release comes after Mistral secured $645 million in funding, strengthening its role as a key player in Europe's AI industry.
Elon Musk says Tesla has ‘no need’ to license xAI models.	Elon Musk has refuted claims that Tesla will share revenue with his AI startup xAI in exchange for using its AI models. He explained that while Tesla has gained from xAI engineers' expertise, it doesn't need to license xAI's models. Musk also noted that xAI's large models are incompatible with Tesla's vehicle computers.
Apple is thinking about a rival to Meta Ray-Ban glasses.	Apple might be developing non-AR smart glasses, positioning them as potential competitors to Meta's $299 Ray-Ban glasses, which also lack AR functionality. Meta's glasses come equipped with features like a camera and an AI chatbot. By excluding AR capabilities, Apple's glasses could be more affordable, lighter, and have improved battery life due to reduced complexity.
OpenAI in talks to raise funds at $150B valuation, Bloomberg says.	OpenAI is in talks to raise $6.5B from investors at a valuation of $150B, people familiar with the matter told Bloomberg
Meta fed its AI on almost everything you’ve posted publicly since 2007.	Unless you’re in the EU, there’s no ability to opt out of AI training settings that keep Facebook or Instagram posts public.
Google is using AI to make fake podcasts from your notes.	Google’s NotebookLM app can now generate ‘lively’ audio discussions with two AI hosts about the documents you’ve given it.
Introducing OpenAI o1-preview.	OpenAI has launched its latest model, designed to think carefully before responding. It was trained using reasoning processes, allowing it to take time to deliberate before providing an answer. This approach has resulted in superhuman performance in certain areas. Initially, users will be limited to around 30 queries per week, though OpenAI plans to remove this restriction shortly.
Google is now rolling out Gemini Live to free users on Android.	Google is launching Gemini Live, its conversational AI tool, to all free Android users following a month of early access for advanced users. With this feature, users can interrupt responses to provide new information and receive text transcripts of their conversations. While extensions like Gmail are not yet supported, Gemini Live introduces ten new voice options, with additional features expected to be added soon.
Sergey Brin says he’s working on AI at Google ‘pretty much every day’.	Google co-founder and ex-Alphabet president Sergey Brin said he’s back working at Google “pretty much every day” because he hasn’t seen anything as exciting as the recent progress in AI — and doesn’t want to miss out.
Amazon starts testing ads in its Rufus chatbot.	Amazon's shopping chatbot, Rufus, will soon incorporate sponsored ads, displaying them based on the user's search queries and the context of their conversations.

Resources

Link	description
OLMoE: Open Mixture-of-Experts Language Models.	Presents a fully open large language model (LLM) that utilizes a sparse Mixture-of-Experts approach. OLMoE is a 7B parameter model with 1B active parameter per input token. An instruction-tuned version is also available, which reportedly surpasses the performance of Llama-2-13B-Chat and DeepSeekMoE 16B.
Large Language Model-Based Agents for Software Engineering: A Survey.	A survey paper on large language model (LLM)-based agents in software engineering, offering insights across various areas such as requirements engineering, test generation, and software maintenance.
DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos.	Researchers were able to produce very accurate depth information without requiring any camera posture or optical flow information by using Stable Diffusion video as a prior model.
SmileyLlama: Modifying Large Language Models for Directed Chemical Space Exploration.	Using DPO-style data and supervised fine-tuning on open-source language models, LLMs can be trained to produce compounds with intriguing features for potential medicinal development.
Running a LLM on the ESP32.	This code demonstrates how to execute a small language model on an Arduino board, showcasing the process of deploying and running AI models on resource-constrained hardware.
DocAI.	This is another example of effectively leveraging existing models to extract structured information from documents, demonstrating the innovative use of pre-trained AI models to automate data extraction tasks efficiently.
FluxMusic.	Text-to-music generation using a rectified flow transformer involves converting text inputs into musical compositions by utilizing a model that combines transformer architectures with rectified flow techniques. This approach enhances the model's ability to generate coherent and diverse music sequences based on textual descriptions.
iText2KG: Incremental Knowledge Graphs Construction Using Large Language Models.	iText2KG is a Python package that leverages large language models to extract entities and relationships from text, progressively constructing consistent knowledge graphs. This tool automates the process of transforming unstructured text into structured knowledge, allowing for the incremental growth of comprehensive knowledge graphs.
Multimodal RAG using ColPali (with Byaldi) and Qwen2-VL.	Merve has created a great resource for using language and vision models to improve retrieval.
Awesome-Text2X-Resources.	This is an open collection of state-of-the-art (SOTA) and novel Text-to-X methods (where X can represent any output, such as images, audio, or 3D models). The collection includes papers, code, and datasets, aimed at staying up-to-date with the expected surge in research developments in this area over the coming months.
Qihoo-T2X: An Efficiency-Focused Diffusion Transformer via Proxy Tokens for Text-to-Any-Task.	The Proxy Token Diffusion Transformer optimizes diffusion transformers by minimizing redundant computations, employing a reduced set of representative tokens for attention processing. This approach enhances efficiency while maintaining model performance.
UniDet3D: Multi-dataset Indoor 3D Object Detection.	UniDet3D is a robust 3D object detection model designed to operate across multiple indoor datasets, delivering strong performance in identifying and detecting objects in three-dimensional spaces.
Starst3r.	This innovative tool leverages Mast3r along with smart optimizations to efficiently reconstruct 3D scenes from just a few 2D images, offering impressive results with minimal input.
simple_tma.	Image processing and cropping that can be run on the GPU.
Lexicon3D.	In a recent study comparing seven visual encoding models for 3D scene understanding, researchers found that the most effective model varied based on the specific task. DINOv2 emerged as the top performer overall, while video models excelled in object-level tasks, and diffusion models outperformed others in geometric tasks. Surprisingly, models pre-trained on language showed notable limitations in this context.
One-DM:One-Shot Diffusion Mimicker for Handwritten Text Generation.	The One-DM model generates handwritten text that can imitate any style using only a single sample as a reference. This approach allows for highly personalized handwriting generation with minimal input data.
optillm.	Optillm assists in optimizing prompts by utilizing various well-established research algorithms, including Monte Carlo Tree Search, Z3 solvers, and Self Consistency, to improve performance.
Train Till You Drop: Towards Stable and Robust Source-free Unsupervised 3D Domain Adaptation.	Researchers tackled the challenge of source-free unsupervised domain adaptation for 3D semantic segmentation by implementing regularization techniques and proposing a new criterion to improve adaptation performance.
Memory-Efficient Optical Flow.	HCVFlow is a newly developed memory-efficient optical flow method designed to address the high computational demands of all-pairs cost volumes in high-resolution images.
Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models.	Concept Sliders offer a powerful mechanism for controlling the output of diffusion models. Recent efforts have been made to integrate them with the new Flux suite of models, enhancing their functionality and adaptability.
Minifying HTML for GPT-4o: Remove all the HTML Tags.	Converting HTML to plain text can significantly reduce costs with minimal performance loss in GPT-4o for data extraction tasks. Tests on the Mercury Prize dataset demonstrated that GPT-4o performs effectively even without the HTML structure, and GPT-4o mini offers a cost-efficient solution for handling unstructured questions. For structured extraction tasks, it's advisable to test both versions to find the right balance between cost and accuracy.
Prompt2Fashion: An automatically generated fashion dataset.	This dataset, created with large language models, curates outfit recommendations for various occasions, styles, and body types, providing high-quality and relevant suggestions.
Sources of Uncertainty in 3D Scene Reconstruction.	Researchers are improving 3D scene reconstruction techniques such as Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (GS) by incorporating uncertainty estimation methods. Although these approaches produce high-quality renders, they face challenges in addressing uncertainties caused by noise, occlusions, and camera inaccuracies.
🦙🎧 LLaMA-Omni: Seamless Speech Interaction with Large Language Models.	Llama Omni is a speech input-output model built on Llama 3.1 8B, designed to operate with extremely low latency while maintaining high-quality responses.
AWS AI Stack.	This ready-to-use, full-stack boilerplate project is designed for building serverless AI applications on AWS. It is ideal for developers looking for a reliable AWS foundation for AI apps and seamless access to powerful LLM models through Bedrock while ensuring your app's data remains separate from model providers.
Internet of Agents.	The Internet of Agents (IoA) is a novel framework aimed at enhancing multi-agent collaboration by enabling more efficient integration of diverse third-party agents.
ell: The Language Model Programming Library.	Ell is a newly released package developed by a former OpenAI scientist, designed to manage prompts as code, streamlining the process of working with prompts in AI applications.
EMO-Disentanger.	This research employs a two-stage model to separate and analyze emotive elements in piano music generation, enabling more expressive and nuanced performances.
Reader-LM: Small Language Models for Cleaning and Converting HTML to Markdown.	Jina has unveiled two cutting-edge models capable of transforming noisy HTML into clean, structured Markdown, optimized for training and reasoning tasks.
Agent Workflow Memory.	Agent Workflow Memory (AWM) is a technique that enables language model-based agents to learn and retain reusable task workflows from previous experiences, allowing them to effectively manage complex, long-horizon tasks.
Hi3D-Official.	Hi3D is a novel model designed to improve the generation of multi-view consistent, high-resolution 3D images from a single input. By using a video diffusion technique, it addresses the limitations of traditional 2D methods that lack 3D awareness, leveraging temporal consistency from video models to enhance geometric coherence across different views.
Fine Tuning Llama 3.1 405B with Axolotl on a Lambda 1-Click Cluster.	Axolotal AI has collaborated with Lambda Labs to demonstrate how their one-click cluster can be used to fine-tune the Llama 3.1 405B model. Although the process requires 64 GPUs, the new tools make it possible with minimal infrastructure setup, streamlining the process significantly.
super-benchmark.	SUPER is a newly introduced benchmark aimed at evaluating how effectively large language models (LLMs) can replicate tasks sourced from research repositories.
Using GPT-4o for web scraping.	An AI-powered web scraper, utilizing OpenAI's GPT-4o, is designed to extract structured data from HTML tables. While it performs well on simple tables, its results are mixed when dealing with more complex tables, such as those with merged rows or intricate structures.

Perspectives

Link	description
‘If journalism is going up in smoke, I might as well get high off the fumes’: confessions of a chatbot helper.	Journalists and other writers are employed to improve the quality of chatbot replies. The irony of working for an industry that may well make their craft redundant is not lost on them
Will AI make us overconfident?	Students are increasingly turning to AI tools like ChatGPT to tackle complex research challenges, surprising educators with their swift advancements. AI-powered development tools, particularly in coding, greatly enhance both ambition and productivity, though they also introduce risks of overconfidence and mistakes. Despite occasional inaccuracies, AI offers valuable interactive starting points for difficult tasks, potentially fostering more active learning and encouraging exploration across disciplines.
LLMs struggle to explain themselves.	An interactive demo was employed to evaluate large language models' (LLMs) ability to recognize and explain number sequences produced by random programs. The findings revealed that although LLMs often correctly identified the sequences, their explanations of the underlying patterns were frequently inaccurate. This underscores the limitations of LLMs' reasoning capabilities, despite their strong performance on standardized tests.
No more free pass: Regulation starts to crack down on social media platforms.	The arrest of Telegram’s CEO in France and the closure of X in Brazil are two of the latest signs that times are changing, with networks beginning to be held more accountable
Here’s how 7 news audience directors are thinking about Google’s AI Overviews.	Google's AI Overviews, which use the Gemini language model, received significant criticism for inaccuracies and potentially harmful recommendations following their launch in the U.S. Despite the negative feedback, Google extended the feature to six additional countries, sparking concerns among publishers about decreased web traffic and distorted content. AI experts and SEO specialists stress the importance of transparency and improved citation methods to preserve trust and ensure consistent traffic.
Diffusion is spectral autoregression.	Diffusion models and autoregressive models share a fundamental similarity, as both rely on iterative refinement processes. The author demonstrates, using Fourier transform techniques, that diffusion models function similarly to approximate autoregression in the frequency domain, especially for visual data. This insight suggests promising pathways for unifying generative modeling approaches across various data types.
Why We Fear Diverse Intelligence Like AI.	The emergence of AI and various forms of intelligence is blurring traditional distinctions between "real beings" and machines. Rather than centering discussions only on AI, it's important to recognize and ethically interact with a broad range of cognitive systems, including bioengineered, robotic, and hybrid entities. By broadening our understanding of intelligence and fostering compassion, we can better navigate the ethical challenges posed by these rapidly evolving technologies.
SGSeg: Enabling Text-free Inference in Language-guided Segmentation of Chest X-rays via Self-guidance.	SGSeg is a segmentation framework for chest X-rays that incorporates language guidance during training but allows for text-free inference during the prediction phase.
Are novelists who worry about the rise of AI really ‘classist and ableist’?	An international writing organization appeared to greenlight the use of AI, prompting anger, the resignation of four board members and an entire creative community to ask: ‘What?!’
AI Chatbots Have a Political Bias That Could Unknowingly Influence Society.	A new study has uncovered strong evidence that we can now add political bias to that list, further demonstrating the potential of the emerging technology to unwittingly and perhaps even nefariously influence society's values and attitudes.
How influencers and algorithms mobilize propaganda — and distort reality.	The engagement-fuelled logic of social media has bequeathed us a world in which what’s trending is a yardstick for what’s true.
Artificial intelligence can help to make animal research redundant.	One alternative in its early stages is artificial intelligence (AI), whereby generative adversarial networks produce animal data. However, there remains a disconnect between AI-generated animal data and human safety data. Computer models that simulate complex human physiological processes could close this gap, with AI used to analyze the resulting data sets.
Wikipedia is facing an existential crisis. Can gen Z save it?	The world’s most important knowledge platform needs young editors to rescue it from chatbots – and its own tired practices
AI-Generated Junk Science Is Flooding Google Scholar, Study Claims.	New study claims to have uncovered a disturbing trend in the world of academic research: AI tools like ChatGPT being used to produce fake scientific papers that are infiltrating Google Scholar, one of the most widely used academic search engines.
Will the "AI Scientist" Bring Anything to Science?	Researchers have created an AI tool capable of automating scientific workflows, from generating hypotheses to executing experiments and drafting research papers. While its accuracy and coherence require further development, critics warn that AI's role in simulations, such as in quantum computing and materials science, may lead to narrower research questions and less impactful findings. Supporters, however, see potential in using this AI to streamline the early stages of research, helping scientists conceptualize and define their projects more efficiently.
Is AI Quietly Sabotaging Itself—And The Internet?	Amid the growth of AI content online, a group of researchers at Cambridge and Oxford universities set out to see what happens when generative AI tools query content produced by AI. What they found was alarming.

Back to index

ML news: Week 2 - 8 September

Research

Link	description
Diffusion Models Are Real-Time Game Engines.	a two-phase training process involving an RL agent to learn and a diffusion model to generate frames; it can interactively simulate DOOM over 20 frames per second on a single TPU. A game engine driven by a diffusion model allows real-time interaction with complex environments over long trajectories.
Agentic Retrieval-Augmented Generation for Time Series Analysis.	suggests an agentic RAG framework for time series analysis. It makes use of a multi-agent architecture in which an agent directs specialized sub-agents to carry out time-series tasks. These sub-agents can retrieve pertinent prompts that contain information about past patterns and trends, which helps to improve predictions on new data. The sub-agents use tuned small language models to accomplish these tasks.
Persuasion Games using Large Language Models.	asserts that the persuasive efficacy of LLMs can be increased by using a multi-agent framework, in which the main agent conducts persuasive dialogue while supporting agents handle crucial functions like information retrieval and response analysis. The study finds that LLMs are capable of influencing users' perspectives and convincing them to make a purchase decision; for example, sales agents can influence user perspectives in a 71% positive way.
Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling.	discovers that synthetic data produced by weaker + less costly (WC) models is superior to data produced by stronger but more expensive models for fine-tuning models; generally, the results imply that WC models might be a compute-optimal method for training sophisticated LLM reasoners.
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model.	demonstrates that it is possible to scale from 7B parameter models to 2T multi-modal tokens that can compete in performance with similar scale diffusion and language models. It also presents a training recipe to train multi-modal models over discrete and continuous data; it combines next token prediction with diffusion to train transformer models over mixed-modality sequences.
ReMamba: Equip Mamba with Effective Long-Sequence Modeling.	examines the long-context capacities and efficiency of Mamba models; the RNN-like nature of Mamba is the cause of the long-context deficiencies; it does this by compressing data using the following method: achieves a 3.2 improvement over the baseline on LongBench and 1.6 improvement on L-Eval; the strategy appears to also apply to Mamba 2. the top-k hidden states during the first forward pass and uses Mamba's selective mechanism to incorporate them into the state space during the second forward pass.
Text2SQL is Not Enough: Unifying AI and Databases with TAG.	develops a benchmark and discovers that standard methods only answer 20 percent of natural language queries correctly. It suggests Table-Augmented Generation (TAG), a unified framework for responding to natural language queries over databases. It represents a wider range of unexplored interactions between LLMs and databases.
Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts.	Sparsifying the computation is aided by routing tokens to MoE experts. But it can be hard to learn that routing. Usually, there is a complex loss structure. This research presents an innovative solution to this issue, leading to a significant increase in training stability and expert balancing.
Toward Robust Early Detection of Alzheimer's Disease via an Integrated Multimodal Learning Approach.	A multimodal classification approach intended to enhance the early detection of Alzheimer's disease is presented in this work.
Targeted Cause Discovery with Data-Driven Learning.	A sophisticated machine learning technique has been created by researchers to determine a target's direct and indirect causal variables within a system.
Stochastic Layer-Wise Shuffle: A Good Practice to Improve Vision Mamba Training.	To prevent overfitting in Vision Mamba models and enable them to scale up to 300M parameters while still performing competitively with Vision Transformers (ViTs), this research presents a stochastic layer-wise shuffle regularization strategy.
Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control.	Stable Control Representations are a tool that researchers are using to help embodied AI machines interpret scenes more precisely. These representations capture detailed visuospatial information required for challenging tasks by utilizing pre-trained text-to-image diffusion models.
AI generates covertly racist decisions about people based on their dialect.	Language models perpetuate covert racism through dialect prejudice, specifically against African American English (AAE), leading to negative stereotypes and harmful consequences, while overt stereotypes about African Americans are more positive, and current bias mitigation practices may worsen this issue.
Latent Distillation for Continual Object Detection at the Edge.	A unique Continual Learning technique for object detection that overcomes memory and computational limitations on edge devices is called latent distillation.
Masked Mixers for Language Generation and Retrieval.	Masked mixers are a unique architecture designed to enhance input representation in language models by substituting masked convolutions for self-attention.
Masked Autoencoders for Microscopy are Scalable Learners of Cellular Biology.	Using masked autoencoders and self-supervised learning, researchers have created a novel technique that greatly enhances the processing of large-scale microscope pictures.
Pooling And Attention: What Are Effective Designs For LLM-Based Embedding Models?	This work compares alternative pooling and attention strategies while examining multiple designs for LLM-based embedding models.
AlphaProteo generates novel proteins for biology and health research.	New AI system designs proteins that successfully bind to target molecules, with potential for advancing drug design, disease understanding and more.

News

Link	description
X goes offline in Brazil after Elon Musk’s refusal to comply with local laws.	Millions of users shut out and 500,000 switch to rival platform Bluesky as providers enact supreme court ban
'A tech firm stole our voices - then cloned and sold them'.	Paul Skye Lehrman and Linnea Sage, voice-over performers, discovered that an AI-powered text-to-speech platform had cloned their voices without permission after they were tricked into providing audio recordings through Fiverr. The couple has filed a lawsuit against the platform, Lovo, for allegedly using their voices illegally.
Did your car witness a crime? Bay Area police may be coming for your Tesla — and they might tow it.	Tesla's Sentry Mode, a feature that uses the car's cameras to monitor its surroundings, is increasingly being used by law enforcement as evidence in criminal investigations. The footage captured by the system has been instrumental in solving various crimes, such as car break-ins and hit-and-run incidents.
Updates to the Command R Series.	Updates were made to Command R and Command R+ for almost every task. Their recall, speed, arithmetic, and reasoning have all improved.
Workers at Google DeepMind Push Company to Drop Military Contracts.	In a letter, almost 200 workers at Google DeepMind demanded that the firm revoke its military contracts, citing a breach of its own AI ethics policy. Armed forces have purchased DeepMind technology from Google Cloud, which has caused internal strife among AI personnel who respect moral principles. Although Google's response showed that the company was following the AI Principles, employees are still not pleased and want further regulation to prevent the military from using their AI.
TRL release.	This could be among the Transformer Reinforcement Learning library's more significant updates. WinRate Callbacks, Liger Kernels, onlineDPO, and other features are included.
xAI Starts Colossus Training Cluster.	With intentions to double its size in a few months, xAI has initiated the 100,000 Colossus H100 training cluster, which is now the largest in the world.
First MLPerf benchmarks for Nvidia Blackwell, AMD, Google, Untether AI.	In MLPerf's LLM Q&A benchmark, Nvidia's new Blackwell chip showed the best per GPU performance, demonstrating notable improvements with its 4-bit floating-point accuracy. Rivals like AMD and Untether AI, however, have displayed encouraging outcomes, especially in terms of energy efficiency. For example, Untether AI's speedAI240 chip performed exceptionally well in the edge-closed category, demonstrating a range of strengths in emerging AI inference technology.
Two Oxford PhDs are building an app to let you remix photos into memes.	A new social network by a duo of Oxford PhDs is working on an app to let you add friends to a photo in a more memeable and fun way.
Apple and Nvidia may invest in OpenAI.	The two tech giants might join OpenAI’s potentially huge funding round.
Boston Dynamics’ new electric Atlas can do push-ups.	In a recent video, Boston Dynamics demonstrated Atlas, its electric biped robot, completing push-ups to highlight the strength of its actuators during its early commercialization phase for factory floor applications.
Meet Boardwalk Robotics’ Addition to the Humanoid Workforce.	The humanoid upper torso robot Alex, by Boardwalk Robotics, is intended for use in manufacturing, logistics, and maintenance. Alex is a legless robot that was developed separately while utilizing the heritage of IHMC's bipedal robot experience. Its designers prioritized manipulation over mobility in order to guarantee efficiency and safety. Pilots are now choosing commercial partners, but researchers can buy Alex right now.
Americans Are Uncomfortable with Automated Decision-Making.	Consumer Reports recently released a national survey finding that Americans are uncomfortable with the use of artificial intelligence (AI) and algorithmic decision-making in their day to day lives. Nearly three-quarters of respondents (72%) said they would be “uncomfortable”
Canva says its AI features are worth the 300 percent price increase.	The design software company is massively jacking up subscription prices for some users.
AI worse than humans in every way at summarising information, government trial finds.	A test of AI for Australia's corporate regulator found that the technology might actually make more work for people, not less.
Reliant’s paper-scouring AI takes on science’s data drudgery.	Karl Moritz Hermann co-founded Reliant AI, which has raised $11.3 million in a seed round to automate academic literature reviews. Tabular, the company's AI solution, promises zero-error data extraction from scientific papers. Reliant offers researchers an intuitive user interface (UI) while utilizing LLMs and patented methodologies to increase efficiency compared to conventional methods. Its usage of in-house hardware highlights its dedication to providing the research sector with premium, domain-specific AI solutions.
Leveraging AI for efficient incident response.	With the help of heuristic retrieval and LLM-based ranking, Meta has developed an AI-assisted root cause analysis system that has successfully identified 42% of the causes in its web monorepo investigations. Improving system accuracy has mostly been achieved by fine-tuning the Llama 2 model using previous data. The organization intends to increase the integration of AI tools with the goal of achieving autonomous processes and proactive risk mitigation.
Artificial Intelligence Predicts Earthquakes With Unprecedented Accuracy.	After testing their AI in China, researchers at the University of Texas were able to predict 70% of earthquakes.
Recall 2.0? Microsoft plans another AI feature that scans everything.	Another AI-driven feature that searches PC content surfaces in Windows 11, raising questions about data privacy.
You.com raises $50M Series B.	The search engine, agent platform, and knowledge base startup You.com has raised more money as it expands.
Sakana raises $100m Series A.	With the increase, Sakana will be able to hire more researchers, expand its computational capacity, and generally establish itself as one of Japan's top AI labs.
Google AI Overviews rollout hits news publisher search visibility.	Some news items now have AI-written summaries available in Google's US and UK search results. According to research, publisher visibility is being impacted by these AI Overviews, which is causing original articles to fall in the search results. To sustain traffic, this move may require major adjustments to SEO tactics.
US, UK, EU and others sign landmark AI safety treaty.	More than a dozen countries have signed a treaty designed to ensure that artificial intelligence models are used in a safe manner.
OpenAI's Next-Generation Models Could Reportedly Cost $2,000.	The Sam Altman-led company's new artificial intelligence models, such as Strawberry and Orion, likely won't be cheap (prices as high as $2,000 per month).
Alleged fraudster got $10 million in royalties using robots to stream AI-made music.	A North Carolina man is facing fraud charges after allegedly uploading hundreds of thousands of AI-generated songs to streaming services and using bots to play them billions of times. Michael Smith is said to have received over $10 million in royalties since 2017 via the scheme.
Advertisers plan to withdraw from X in record numbers.	A record number of firms plan to cut advertising spending on X next year because of concerns that extreme content on the platform could damage their brands, dealing another blow to the financial fortunes of Elon Musk’s social media company.
Dutch Regulator Slams Clearview AI with €30.5 Million Penalty for “Massive” Rights Breach.	The Dutch Data Protection Authority (DPA) announced on Tuesday that it has imposed a €30.5 million ($33.7 million) fine on US facial recognition company Clearview AI for illegally creating a database of billions of facial images.
M&S using AI as personal style guru in effort to boost online sales.	Shoppers can use technology to advise them on outfit choices based on their body shape and style preferences
Google’s AI-powered Ask Photos feature begins US rollout.	More sophisticated natural language queries may now be used to search through photographs with Google photographs' new AI-powered search function, "Ask Photos," which is now available to a limited number of American users.
Alibaba releases new AI model Qwen2-VL that can analyze videos more than 20 minutes long.	Qwen2-VL, a new vision-language model with improved visual understanding, multilingual text-image processing, and video comprehension, has been published by Alibaba Cloud. In comparison to models such as Meta's Llama 3.1 and OpenAI's GPT-4o, Qwen2-VL performs better and is compatible with a wider range of applications, such as real-time video analysis and technical help. The models are open-source under Apache 2.0 for the smaller versions, and are available in three sizes (7B, 2B, and shortly 72B).
Broadcom is working to integrate optical connectivity directly into GPUs.	Currently, one of the main obstacles to training large models is the bandwidth of GPU interface. The problem would be much reduced if Broadcom could include optical transfer directly into GPUs, as they are now working on doing.
YouTube is making tools to detect face and voice deepfakes.	It plans to launch a pilot program for the voice detection tool by early next year.
Google is working on AI that can hear signs of sickness.	Given everything you’ve already heard about AI, you may not be surprised to learn that Google is among other outfits beginning to use sound signals to predict early signs of disease.

Resources

Link	description
AutoGen Studio: A No-Code Developer Tool for Building and Debugging Multi-Agent Systems.	An interface written in minimal code to quickly prototype AI agents. It may be used for multi-agent workflow evaluation and debugging, and it is constructed on top of the AutoGen framework.
Foundation Models for Music: A Survey.	gives a thorough rundown of the most recent pre-trained models and foundation models in the music industry.
A Practitioner's Guide to Continual Multimodal Pretraining.	a thorough manual on ongoing multimodal related; presents FoMo-In-Flux, a large-scale continuous pretraining benchmark with fine-grained and extended horizons.
AI feedback loop will spell death for future generative models.	When you train LLMs with LLM-generated content, the results tend to be digital poop
Apple's robotics work aims to solve user's first-world problems.	Apple might be getting more involved in robotics and releasing moving gadgets, like an iPad supported by a robotic arm. Under the direction of Vice President of Technology Kevin Lynch, Apple is making headway in robotics with the assistance of specialists from companies such as Israel's Technion, and plans to expand its AI interfaces beyond Siri. Apple is thinking of releasing these new robotic devices around 2026 or 2027, while they are still conceptual.
Towards Real-world Event-guided Low-light Video Enhancement and Deblurring.	Using event cameras, this end-to-end system concurrently solves motion deblurring and low-light enhancement in videos.
Enhancing Sound Source Localization via False Negative Elimination.	To overcome false negatives in conventional methods of sound source localization, researchers have put forth a novel audio-visual learning framework. Two schemes are included in the framework: Semantic-Aware Contrastive Learning (SACL) and Self-Supervised Predictive Learning (SSPL). While SACL improves the contrastive learning process to better align auditory and visual elements, SSPL removes false negatives by emphasizing positive-only learning.
FastSD CPU.	Flux Schnell on the CPU is now supported by a widely used inference library.
Spiking Diffusion Models.	A new class of Spiking Neural Networks (SNNs) called Spiking Diffusion Models (SDMs) is intended for image production and offers significant energy savings along with great biological plausibility.
Laion 5B safety Release.	The biggest publicly available image dataset on the internet was Laion 5B. Because of worries about offensive and hazardous imagery, it was taken down. After a major effort to address these problems, the group is now rereleasing the dataset.
ml_dtypes.	Bfloat16 and fp8 support for native numpy arrays.
VisionTS.	By redefining time series forecasting as an image reconstruction challenge, VisionTS is a novel method that takes advantage of the similarities between time series data and natural images to improve forecasting. To achieve remarkable zero-shot performance, it makes use of a visual masked autoencoder (MAE) that has been pre-trained on ImageNet.
Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model.	A novel method for improving LLMs' audio-generating performance is called X-Codec.
The timm (PyTorch Image Models) Leaderboard.	This leaderboard is based on the results of the models from Timm. Timm comprises various vision models.
CogVideoX-5B.	CogVideo 5B model will launch next week in Hugging Face Diffusers.
Anthropic Quickstarts.	Anthropic has made available a helpful selection of initial projects. It collaborated with former chief AI officers from Brex, Uber, Facebook, and other companies to draft the first Quickstart, a Claude-powered scalable customer support assistant.
The Missing Guide to the H100 GPU Market.	This guide covers all the important factors of buying a GPU, such as availability considerations, pricing for various alternatives, and guaranteeing reliability in addition to highlighting the significance of other hardware features. It answers the most important queries consumers have about GPUs, including pricing, performance, and shipping.
Efficient Camera Exposure Control for Visual Odometry via Deep Reinforcement Learning.	A deep reinforcement learning framework is being developed in this research to enhance the stability of visual odometry (VO) systems in difficult-to-light settings.
Multi-scale Cross-restoration Framework for Electrocardiogram Anomaly Detection.	a sophisticated ECG diagnosis system that enhances the identification of uncommon but serious cardiac anomalies by self-supervised anomaly detection pretraining.
RWKV.cpp.	The great RWKV models have included a local inference model with its CPP project.
MAPF-GPT.	A novel learning-based method called MAPF-GPT has been developed to tackle the difficult multi-agent pathfinding (MAPF) problem. The model navigates agents by imitation learning; it does not require extra heuristics, reward functions, or communication.
EnsLoss.	An ensemble approach called EnsLoss integrates loss functions into the Empirical Risk Minimization (ERM) paradigm.
Disentangled Motion Modeling for Video Frame Interpolation.	MoMo is a novel diffusion-based approach for video frame interpolation (VFI). It enhances visual quality by focusing on intermediate motion modeling through a disentangled two-stage training process.
repo2vec.	Repo2vec is a new package that functions similarly to GitHub Copilot but with up-to-date repo information, making it simple to communicate with any public or private codebase.
Building LLMs from the Ground Up: A 3-hour Coding Workshop.	Great resource about LLM building from scratch
SGLang v0.3 Release.	The most recent release brings enhancements to SGLang inference, including Multi-Image/Video LLaVA-OneVision, 1.5x Faster torch.compile, and 7x Faster DeepSeek MLA.
OLMoE: Open Mixture-of-Experts Language Models.	Best in-class performance for 1B activated parameters in an excellent open MoE.
StyleTokenizer: Defining Image Style by a Single Instance for Controlling Diffusion Models.	This work presents StyleTokenizer, an approach that aligns style representation with text prompts to improve style control in text-to-image generation.
Applied Machine Learning (Cornell CS5785, Fall 2024).	Open resources for the Fall 2024 Applied ML class at Cornell.
Laminar - Open-Source observability, analytics, evals and prompt chains for complex LLM apps.	Laminar hosts background job queues of LLM pipelines. Outputs of those pipelines are turned into metrics.
LongLLaVA.	A multimodal model called LongLLaVA was created to handle long-context tasks like comprehending high-resolution images and videos.

Perspectives

Link	description
I learned the language of computer programming in my 50s – here’s what I discovered.	A writer with no technical background recounts his incredible journey into the realm of coding and the invaluable lesson it taught him about the modern world
Why A.I. Isn’t Going to Make Art.	To create a novel or a painting, an artist makes choices that are fundamentally alien to artificial intelligence.
Autonomous car bombs, online recruitment: Experts worry how AI can transform terrorism.	Law enforcement has to anticipate novel AI uses and develop countermeasures
Researchers built an ‘AI Scientist’ — what can it do?	The large language model does everything from reading the literature to writing and reviewing its own papers, but it has a limited range of applicability so far.
The Next Generation Pixar: How AI will Merge Film & Games.	With its ability to combine dynamic gaming engagement with narrative depth, generative AI has the potential to completely transform storytelling. This change is being accelerated by recent developments in generative models, such as Luma AI's Dream Machine and OpenAI's Sora, which allow for the creation of interactive videos in real-time. This development, which combines AI, gaming, and film, could result in the next "Pixar" in interactive media.
China's robot makers chase Tesla to deliver humanoid workers.	At the World Robot Conference in Beijing, more than 25 Chinese businesses featured humanoid robots designed for factory automation. These companies were supported by significant government funding and took advantage of China's extensive supply network. By 2035, the market for humanoid robots is expected to reach $38 billion globally. By 2025, China hopes to have these robots in large quantities, stepping up the battle with Tesla's planned Optimus robot. Tesla expects to roll out 1,000 Optimus robots in its factories over the course of the next year, while Chinese companies are predicting substantial cost savings on their models.
Why AI can’t spell ‘strawberry’.	Because of their tokenization techniques, large language models occasionally perform poorly on tasks like letter counting. This demonstrates how the LLM architecture has shortcomings that impact how well they comprehend text. Nevertheless, developments are still being made. For example, Google DeepMind's AlphaGeometry 2 for formal math and OpenAI's Strawberry for enhanced reasoning
Diffusion is spectral autoregression.	It's common knowledge that auto-regressive models and diffusion models are essentially distinct types of methodologies. When it comes to diffusion models that genuinely take auto-regressive steps in the frequency domain, they might, in fact, be more comparable than we previously realized.
Can AI Scaling Continue Through 2030?	AI training is expanding at a rate that has never been seen before—four times faster than previous technology advances in genome sequencing and mobile use. According to research, the main limitations in scaling AI training could last until 2030 and are related to power availability and chip production capacity. If hundreds of billions are committed, training runs up to 2e29 FLOP would become feasible, representing significant advancement comparable to the transition from GPT-2 to GPT-4. Advanced network topologies and multimodal and synthetic data production methodologies might help overcome difficulties like data shortages and latency.
GPU Utilization is a Misleading Metric.	Although frequently tracked, GPU utilization may not fully capture GPU performance in machine learning workloads since it does not take into consideration whether the GPU's computational power is being utilized to its fullest. Trainy found this out when, during LLM training, 100% GPU usage was achieved, but only ~20% model FLOPS utilization (MFU) was achieved. It suggests using fused kernel optimization and the appropriate model parallelism level to obtain a 4x speedup in training time and tracking SM efficiency for a better performance indication.
AI-Implanted False Memories.	In simulated criminal witness interviews, generative chatbots driven by massive language models greatly increased the generation of false memories, inducing roughly three times more instantaneous false recollections than a control group, according to a study by MIT Media Lab.
The biology of smell is a mystery — AI is helping to solve it.	Scientists are beginning to crack the fiendishly complex code that helps us to sense odours.
How much is AI hurting the planet? Big tech won't tell us.	big tech companies, like Google, are not disclosing the full environmental impact of AI, while emissions from their operations have significantly increased, with Google's greenhouse gas emissions rising by 48% between 2019 and 2023
AI Has Created a Battle Over Web Crawling.	A research by the Data Provenance Initiative cautions that when websites restrict crawler bots more and more, high-quality data may become inaccessible to generative AI models. This trend, which is motivated by worries about data exploitation, may cause AI training to rely more on low-quality data rather than well-maintained sources. Businesses may use direct licensing or synthetic data to preserve the effectiveness of AI models in the face of increasing data scarcity.
What Succeeding at AI Safety Will Involve.	Sam from Anthropic hazard a guess as to what will have to be done in order for AI safety to be successful while creating superhuman AI systems.
the art of programming and why i won't use llm.	Although LLMs are praised for increasing productivity and are being incorporated into coding workflows more and more, some contend that their programming effectiveness is overstated.
‘He was in mystic delirium’: was this hermit mathematician a forgotten genius whose ideas could transform AI – or a lonely madman?.	In isolation, Alexander Grothendieck seemed to have lost touch with reality, but some say his metaphysical theories could contain wonders
AI Checkers Forcing Kids To Write Like A Robot To Avoid Being Called A Robot.	Can the fear of students using generative AI and the rise of questionable AI “checker” tools create a culture devoid of creativity?
The AI Arms Race Isn’t Inevitable.	Prominent AI labs are pushing Western governments to support swift AI developments in order to prevent rivals like China from gaining a decisive technological advantage. They are increasingly portraying AI research as a geopolitical zero-sum game crucial for national security. This story supports drastic steps to ensure AI domination, even at the expense of escalating geopolitical tensions and possibly jeopardizing safety and ethical standards.
Is AI eating all the energy?	AI's total energy footprint is influenced by both rising demand and rising energy efficiency. Power, heat, carbon, and water use are all positively connected with AI's energy consumption. The general trend of AI processing becoming more power-hungry is being countered by hardware efficiency improvements. Although its influence is lessened by broad use, AI still accounts for a small but growing portion of data center power consumption, with training activities using a lot more energy than inference.
Debate over “open source AI” term brings new push to formalize definition.	In an effort to clarify the meaning and address the term's overuse, the Open Source Initiative (OSI) published a proposed definition of "open source AI" that includes usage rights, study, modification, and sharing freedoms. With this step, researchers and engineers will be able to assess AI systems in a more transparent manner. In October, a stable version of the definition is anticipated, which may have an impact on upcoming releases of AI models and regulations.
Predicting AI.	This author considers their forecasts for AI and notes that they were correct to predict the growth of open source, multimodal models, and improved tool usability.
Bill Gates has a good feeling about AI.	The Verge spoke with Bill Gates about AI, misinformation, and climate change.
Enterprise AI Infrastructure: Privacy, Maturity, Resources.	An interesting interview with BentoML's CEO discusses how to enhance business tooling, make sure you can expand, and avoid over-engineering it from the start.

Back to index

ML news: Week 26 August - 1 September

Research

Link	description
Automated Design of Agentic Systems.	declares that it is possible to learn any possible agentic system, including prompts, tool use, control flows, and more, using their approach. They accomplish this by concentrating on three main components, known as search space (define agents), search algorithm (explore search space), and the evaluation function (evaluate candidate agents). presents Meta Agent Search, a meta agent that iteratively programs and tests new agents based on a growing archive of previous discoveries.
LLM Pruning and Distillation in Practice: The Minitron Approach.	presents pruning and distillation techniques applied to the original models to produce 4B and 8B parameter models, respectively. Before pruning, they also fine-tune the teacher model on their datasets leading to better distillation; their compression strategy yields a state-of-the-art 8B model (MN-Minitron-8B) which outperforms all similarly-sized models on common language modeling benchmarks. offers a thorough report on effective methods for compressing Llama 3.1 and Mistral NeMo models.
The Vizier Gaussian Process Bandit Algorithm.	introduces Vizier, an open-source Python implementation of the Gaussian process bandit optimization technique, which is utilized by Google for millions of optimizations and research. It includes benchmarking data that show the algorithm's wider applicability.
Enhancing Robustness in Large Language Models: Prompting for Mitigating the Impact of Irrelevant Information.	proposes a two-stage prompting technique to remove irrelevant information from context; it serves as a self-mitigation process that first identifies the irrelevant information and then filters it out; this leads to enhancement in robustness of the model and overall better performance on reasoning tasks.
MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding.	demonstrates how speculative decoding can improve throughput, lower latency, and preserve accuracy in long context generation scenarios; it discovers that bottlenecks change from compute-bound to memory-bound as sequence length and batch size increase; with these realizations, they demonstrate that speculative decoding can be used more successfully for longer sequences, even when using large batch sizes.
PEDAL: Enhancing Greedy Decoding with Large Language Models using Diverse Exemplars.	employs a hybrid self-ensembling approach (based on diverse exemplars) to enhance LLM performance overall. Specifically, it generates multiple candidate responses using diverse exemplars and aggregates them using an LLM to produce a final response; this approach achieves lower cost compared to self-consistency approaches and better accuracy compared to greedy decoding.
Autonomous Driving with Spiking Neural Networks.	The first unified Spiking Neural Network (SNN) designed to tackle the energy issues associated with autonomous driving is called Spiking Autonomous Driving (SAD).
Pre-training Small Base LMs with Fewer Tokens.	By inheriting a few transformer blocks and training on a very small percentage (0.1%) of the initial data, Inheritune is a simplified technique for creating smaller base language models from larger ones. With just one A6000 GPU and this method, a 1.5B parameter model could be created in less than 30 minutes, with performance comparable to larger models trained on much greater amounts of data.
Teaching chat models to solve chess puzzles.	At 1800 elo on average, traditional base language models are rather competent chess players. Nevertheless, chat models frequently see a sharp decline in performance. This article explains how to use prompting and fine-tuning to teach conversation models, such as GPT-4o, to play chess.
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations.	The text-to-video (T2V) model xGen-VideoSyn-1 from Salesforce creates lifelike scenes based on written descriptions. The model makes use of a diffusion transformer (DiT) for enhanced temporal consistency and generalization and a video variational autoencoder (VidVAE) for video data compression, which lowers processing requirements.
Memory-Efficient LLM Training with Online Subspace Descent.	Online Subspace Descent is a novel optimizer that increases memory efficiency to improve LLM training.
Generative Verifiers: Reward Modeling as Next-Token Prediction.	Typically, reward models are taught to be discriminative classifiers. The reward signal in this DeepMind experiment is the yes/no logits of a language model. It was discovered that enabling a model to incorporate ensembling and CoT increased performance by sixteen percent.
Multilingual Arbitrage: Optimizing Data Pools to Accelerate Multilingual Progress.	By using the discrepancy between routing synthetic data creation and oracle model performance, Cohere's Aya model was able to significantly increase its win rate in comparison to baseline models.
Text2SQL is Not Enough: Unifying AI and Databases with TAG.	A novel paradigm called Table-Augmented Generation answers complex natural language queries by fusing databases and language models.
The Mamba in the Llama: Distilling and Accelerating Hybrid Models.	Because mamma models do not include a KV cache for backtracking, they are difficult to accelerate with speculative decoding. This document presents several new distillation techniques and acceleration algorithms from some of the original authors.
Efficient LLM Scheduling by Learning to Rank.	Head of-line bottlenecks occur when delivering multiple concurrent requests to a large language model since we don't know how long output generation will take. The shortest requests can be served first if you can learn to rank the relative lengths between them, which will increase throughput for multi-batch generation by 6.5 times.
MTMamba++: Enhancing Multi-Task Dense Scene Understanding via Mamba-Based Decoders.	A new model architecture called MTMamba++ aims to improve multi-task scene understanding. This method captures long-range dependencies and enhances cross-task interactions using a Mamba-based decoder with two core blocks: STM and CTM.

News

Link	description
Scientists to use AI to analyze 1.6m brain scans to develop tool predicting dementia risk.	Researchers will use artificial intelligence to match image data of patients from Scotland with linked health records
Microsoft releases powerful new Phi-3.5 models, beating Google, OpenAI, and more.	Microsoft unveiled the Phi-3.5-mini-instruct, Phi-3.5-MoE-instruct, and Phi-3.5-vision-instruct, three new models in its Phi series that each achieve remarkable benchmark achievements while tackling distinct AI tasks. Developers can access these models on Hugging Face and they are offered as open source under the MIT License. The Phi models have outperformed rivals like GPT-4o and Llama in certain benchmarks, demonstrating near-state-of-the-art performance despite their smaller size than some of their contemporaries.
Data Exfiltration from Slack AI via indirect prompt injection.	It was found that there is a vulnerability in Slack AI that allows attackers to use indirect prompt injection to steal data from private channels they do not have access. Through the use of public channel messages, attackers can coerce the LLM into disclosing sensitive data, like API keys, in response to queries. This problem continues, along with a phishing attack vector, even after Slack AI's update on August 14th, which added channel and DM files and greatly increased the surface area at risk for exploits of this kind.
Bringing Llama 3 to life.	Llama 3.1, an enhanced open-source LLM from Meta, adds new features like model distillation and the ability to generate synthetic data.
Anthropic reveals system prompts for Claude.	Anthropic has updated all models' dates and included system prompts.
D-ID launches an AI video translation tool that includes voice cloning and lip sync.	AI video creation platform D-ID is the latest company to ship a tool for translating videos into other languages using AI technologies. However, in this case, D-ID also clones the speaker’s voice and changes their lip movements to match the translated words as part of the AI editing process.
Vyond Pushes AI Video's Enterprise Era.	Vyond is an AI platform for creating videos with an emphasis on enterprise use cases.
Mark Zuckerberg says White House ‘pressured’ Facebook to censor Covid-19 content.	Meta boss regrets bowing to government power and says he would not make the same choices today
What the Telegram founder’s arrest means for the regulation of social media firms.	Pavel Durov’s detention by French authorities is a major break from the norm – but his low-moderation, non-encrypted app is an anomaly
Tesla Is Erasing Its Own History.	CEO Elon Musk’s original Tesla Motors Master Plan no longer exists on Tesla’s website.
After a decade of free Alexa, Amazon now wants you to pay.	AI is a chance for companies to charge for products we’re in the habit of using for free.
AI for creating comics? Europe’s industry completely rejects it, Tintin executive says.	Tools such as Midjourney and Dall-E have triggered a fightback in comic land as publishers gear up for litigation ahead of new EU rules
Police officers are starting to use AI chatbots to write crime reports. Will they hold up in court?	AI technology is being integrated into police work to automate the writing of reports from body camera footage.
Questions about the safety of Tesla’s ‘Full Self-Driving’ system are growing.	Tesla has been accused of deceptive marketing over its self-driving technology, as a prominent analyst questions the safety and readiness of the system, potentially leading to increased scrutiny of automated driving claims.
Japan: AI-powered drones to monitor disaster zones and identify criminals.	Drones move faster than police cars or guards, reaching incident site quickly and allowing for prompt action and response.
Artifacts are now generally available.	Artifacts are now widely accessible, including on mobile devices, thanks to Anthropic.
Introducing Cerebras Inference.	Large unified memory is present in the chipset of Cerebras. It can therefore avoid problems with bandwidth and serve models at thousands of tokens per second.
OpenAI Aims to Release New AI Model, ‘Strawberry,’ in Fall.	"Strawberry" is a new AI product that OpenAI intends to launch in the fall. It will be able to carry out complex jobs like creating marketing plans and will have advanced thinking abilities, such as the capacity to answer math problems that have never been seen before.
This 1mm 'fan on a chip' could put active cooling inside ultra-thin gadgets.	The XMC-2400 µCooling chip, a 1mm-tall solid-state fan intended to cool down thin electronics such as smartphones, has been introduced by xMEMS.
Nvidia rides big tech’s AI investment to beat Wall Street’s sky-high expectations.	Chipmaker, third most valuable company in world, records $30.04bn in revenue, showing AI demand continues to rise
AI makes racist decisions based on dialect.	Large language models strongly associated negative stereotypes with African American English
Lawmakers call for crackdown on AI deepfakes after Grok backlash.	A group of Democratic lawmakers are pushing the Federal Election Commission (FEC) to increase regulation on artificial intelligence (AI) deepfakes following the release of the social platform X’s chatbot Grok.
Midjourney says it’s ‘getting into hardware’.	Midjourney, the AI image-generating platform that’s reportedly raking in more than $200 million in revenue without any VC investment, is getting into hardware.
Google rolling out Gems and Imagen 3, with people generation, to Gemini Advanced.	Gems are “custom versions of Gemini” that you can create to “act as an expert on topics or refine them toward your specific goals.” They can “remember a detailed set of instructions to help you save time on tedious, repetitive or difficult tasks.”
OpenAI in Talks for Funding Round Valuing It Above $100 Billion.	With Microsoft anticipated to take part, OpenAI is in talks to raise several billion dollars in a fresh investment round headed by Thrive Capital, which would value the business over $100 billion.
How to harness AI’s potential in research — responsibly and ethically.	Artificial intelligence is propelling advances in all areas of science. But vigilance is needed, warn four researchers at the leading edge.
The On‑Device Intelligence Update.	Cartesian has released several updates to its models and systems. Additionally, an open hybrid State space model has been released.
Stephen Wolfram thinks we need philosophers working on big questions around AI.	Stephen Wolfram, a renowned mathematician and computer scientist, has grown to appreciate the importance of philosophy in understanding and guiding the development of AI. He argues that as AI raises profound existential and moral questions, integrating philosophical thinking into AI research is crucial for addressing these complex issues, signaling a potential "golden age" of philosophy in the context of technology.
The top AI deals in Europe this year.	Despite general headwinds for startups, AI ventures continue to secure substantial funding. U.S. AI startups have achieved nearly 30 deals over $100M in 2024, with Europe not far behind. Major investments include WAYVE ($1B), Mistral AI (~$1B), Helsing ($484M), Poolside ($400M), DeepL ($320M), H ($220M), and Flo Health ($200M).
California advances landmark legislation to regulate large AI models.	Groundbreaking bill aims to reduce potential AI risks – requiring model testing and disclosure of safety protocol
Nvidia shares fall on slowing growth and production concerns.	Doubling of quarterly revenues to £23bn fails to allay worry about delays to next generation of AI chips
X’s AI tool Grok lacks effective guardrails preventing election disinformation, a new study finds.	The Center for Countering Digital Hate (CCDH) found that Grok was able to churn out ‘convincing’ AI fake images including one of Vice President Kamala Harris doing drugs and another of former president Donald Trump looking sick in bed
100M Token Context Windows.	It isn't a typo, yes. 100 million tokens for agent programming and reasoning in context. Additionally, Magic Dev disclosed a collaboration to construct two new supercomputers on Google Cloud. This is a result of a recent $320 million fundraising effort to quicken the company's product development.
OpenAI and Anthropic will share their models with the US government.	The companies will grant the AI Safety Institute access to major new models for safety testing.
California legislature passes controversial “kill switch” AI safety bill.	After passing the State Assembly, California's contentious AI safety bill, SB-1047, is now one step closer to being signed into law by Governor Gavin Newsom. By September 30, Newsom must determine whether or not to sign it into law.
OpenAI says ChatGPT usage has doubled since last year.	OpenAI reported that 92% of Fortune 500 firms utilize ChatGPT, and that the platform has over 200 million weekly active users—a tripling of its user base from a year ago.
TikTok owner ByteDance launches new video search tool, eyeing Baidu’s dominance.	In a direct challenge to Baidu's search dominance, ByteDance has released Douyin Search, an app for searching short video content on TikTok's Chinese counterpart.

Resources

Link	description
Language Modeling on Tabular Data: A Survey of Foundations, Techniques, and Evolution.	includes topics like classification of tabular data structures and data types, datasets used for model training and evaluation, modeling techniques and training objectives, data processing methods, popular architectures, challenges, and future research directions. It also provides a thorough survey of language modeling techniques for tabular data.
Graph Retrieval-Augmented Generation: A Survey.	focuses on methods used in the GraphRAG workflow (graph-guided retrieval, graph-based indexing, and graph-enhanced creation); explores GraphRAG's tasks, applications, assessment, and industrial use cases.
Controllable Text Generation for Large Language Models: A Survey.	gives a thorough overview of controllable text generating techniques in LLMs; covers topics like helpfulness, safety, consistency, and style.
Challenges and Responses in the Practice of Large Language Models.	selects several significant questions and provides thoughtful answers; the questions are divided into groups according to themes including data, applications, infrastructure, software architecture, and brain science.
Self-Supervised Learning of Time Series Representation via Diffusion Process and Imputation-Interpolation-Forecasting Mask.	The first diffusion-based method for learning time series representations is called Time Series Diffusion Embedding, or TSDE. Time series data is divided into segments by TSDE, which then creates informative embeddings by using dual-orthogonal Transformer encoders with a crossover mechanism.
Liger Kernel: Efficient Triton Kernels for LLM Training.	Surprisingly, LinkedIn released the Liger Kernel, a productive set of kernels for training language models. For the widely used Llama models, it reduces memory utilization by about 60% and boosts throughput by 20%. It interacts with several common modeling frameworks and just takes three lines of code change, which is important for practitioners.
pgvectorscale.	With better performance for embedding search and more affordable storage for AI applications, pgvectorscale expands upon pgvector. Compared to other popular and competitive vector retailers, it is about 28 times faster.
GenderCARE.	A thorough framework called GenderCARE is designed to identify and lessen gender prejudices. It presents novel standards for assessing gender prejudice, with a focus on diversity, inclusivity, and impartiality.
Generalized SAM: Efficient Fine-Tuning of SAM for Variable Input Image Sizes.	A novel technique for more effectively fine-tuning the Segment Anything Model (SAM) with variable-size images is called Generalized SAM (GSAM).
google/siglip-so400m-patch14-224.	A new SigLIP model from Google leverages a vision transformer model architecture that is tuned for shape.
GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting.	Using surround views, GaussianOcc is an effective and entirely self-supervised approach for 3D occupancy estimate.
Infinite Dataset Hub.	This space, which is powered by phi-3-mini, generates data on any topic using a rarity prompt. It is intriguing and potent even though it isn't the most accurate.
Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models.	By conditioning on individual object representations, neural networks are able to represent and manage 3D objects in 2D contexts. This work could be the key to untangling 3D objects.
T3M: Text Guided 3D Human Motion Synthesis from Speech.	T3M is a brand-new technique that researchers have developed for producing 3D animations that are controlled by text inputs. T3M is a useful technology for virtual reality, gaming, and film creation because it enables more precise and customized animations than earlier methods that solely used voice.
BiRefNet.	Bireference segmentation with background removal at the cutting edge of technology.
RB-Modulation.	Google has developed a really innovative method for customizing diffusion models that works better than several widely used techniques. It may be used with PyTorch and, with some adjustments, Flux as well.
FlexEdit: Marrying Free-Shape Masks to VLLM for Flexible Image Editing.	With FlexEdit, you may precisely modify images based on language commands by combining free-shape masks with Vision Large Language Models (VLLMs).
Quick Fine-tuning of Phi 3.5.	Quick fine-tuning script with Unsloth of the new Microsoft models.
Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning.	A paper detailing DeepSeek's hardware-software co-design approach for deep learning has been published.
Announcing Higgs Llama V2.	Higgs-Llama-3-70B-v2, a new model from Boson AI, performs exceptionally well on conversation and comprehension benchmarks such as Arena-Hard and AlpacaEval 2.0. Compared to Claude 3.5 Sonnet, the model increases day 1 retention by 5.3% and decreases response regeneration rates by 21.6%. Improved using an internal reward model called Higgs Judger, its performance is tied to that of Google's Gemini 1.5 Pro.
The Zyphra Training Cookbook.	Pre-training normal Transformers is not the same as pre-training hybrid (Mamba type) models. To get the desired performance, this post examines scaling various hyperparameters, data gathering, and other factors.
LlamaDuo.	This is a system that optimizes small models to act as a backup if closed API models become unavailable. It demonstrates a smooth transition from a large to a small model.
LitServe.	A flexible and user-friendly serving engine for AI models based on FastAPI is called LitServe. The need to rebuild a FastAPI server for each model is eliminated by features like batching, streaming, and GPU autoscaling.
IntelLabs/LlavaOLMoBitnet1B.	Llava BitNet is the first ternary (-1, 0, 1) weight model trained on VLM tasks. The model, weights, and scripts are in the process of being fully open-sourced. The technical report will be released soon and suggests the model has promising performance.
Qwen2-Audio.	Qwen has released audio input style models that can reason about music, audio, and sound.
Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from User's Casual Sketches.	This team developed an incredible model that generates fully playable 3D game scenarios from a single input sketch by sequentially using many models.
OctFusion: Octree-based Diffusion Models for 3D Shape Generation.	OctFusion is an efficient and high-quality method for using diffusion models to generate 3D objects. In about 2.5 seconds, it can generate 3D shapes at any resolution using a single Nvidia 4090 GPU.
MambaInLlama.	By reusing weights from attention layers, researchers have shown that massive Transformer models can be reduced to more deployable linear RNNs.
Cross-Modal Temporal Alignment for Event-guided Video Deblurring.	By incorporating an event camera—which records motion with microsecond temporal resolution—researchers have created a novel method for video deblurring that improves the quality of motion-blurred footage.
JoyCaption Pre-Alpha.	An open-source VLM created especially for upcaptioning images.
Introducing RPBench-Auto.	An automated evaluation pipeline called RPBench-Auto, which draws inspiration from ArenaHard and Alpaca Eval, has been introduced by Boson AI to measure the role-playing talents of LLMs.
Lightweight Champ: NVIDIA Releases Small Language Model With State-of-the-Art Accuracy.	Mistral-NeMo-Minitron 8B is a miniaturized version of the recently released Mistral NeMo 12B model, delivering high accuracy combined with the compute efficiency to run the model across GPU-accelerated data centers, clouds, and workstations.
NousResearch/hermes-function-calling-v1.	Excellent publicly available dataset from Nous Research to train call function models.
Qwen2-VL: To See the World More Clearly.	Qwen2-VL is the latest version of the vision language models based on Qwen2 in the Qwen model families
RAW-Adapter: Adapting Pre-trained Visual Model to Camera RAW Images.	A novel method called RAW-Adapter modifies pre-trained sRGB models so they can efficiently handle RAW data from cameras.
Llama usage double May through July.	Meta has published some usage statistics for the Llama model. It discovered that there was a high demand for its models being used in business environments.
SAM & SAM 2 in 3D Slicer: SegmentWithSAM Extension for Annotating Medical Images.	In order to expedite the annotation of 3D medical pictures, this study modified the Segment Anything Model 2 (SAM 2), which was initially created for video annotation.

Perspectives

Link	description
AI analysed 1,500 policies to cut emissions. These ones worked.	Only 63 climate change interventions led to significant reductions in carbon emissions.
AI cheating is overwhelming the education system – but teachers shouldn’t despair.	With adjustments to the way we teach students to think about writing, we can shift the emphasis from product to process
What’s Really Going On in Machine Learning? Some Minimal Models.	The inventor of Wolfram
AI companies are pivoting from creating gods to building products. Good.	AI firms are finding it difficult to match their products to the markets for LLMs, which has resulted in large investments but little profit. The five primary obstacles impeding the commercialization of AI products are price, dependability, security and safety concerns, privacy, and user interface constraints. It is imperative that these sociotechnical obstacles are resolved in order for AI to be widely integrated and used in consumer goods.
My friend, Claude.	Due to increased job obligations, this author relies on Anthropic's LLM Claude for technical writing, highlighting the expanding value of LLMs in professional settings. Claude's help has been cost-effective even though it required expert verification, and it highlights how quickly the landscape for specialty experts confronting AI-driven automation is changing. The author considers how knowledge work may change when AI technologies like Claude are more frequently used for everyday tasks.
AI firms must play fair when they use academic data in training.	Researchers are among those who feel uneasy about the unrestrained use of their intellectual property in training commercial large language models. Firms and regulators need to agree on the rules of engagement.
Stakes high for European Union after arrest of Telegram co-founder.	The charges against Pavel Durov increases pressure on Brussels to enforce new European law on the platform
MIT neuroscientists discover neurons with distinct language processing timescales.	In language-processing areas of the brain, some cell populations respond to one word, while others respond to strings of words.
How to Tell If What You're Reading Was Written By AI.	From the moment ChatGPT introduced the world to generative AI in late 2022, it was apparent that, going forward, you can no longer trust that something you're reading was written by a human.
California AI bill sparks debate in Silicon Valley as some tech giants call it a threat to innovation.	A first-of-its-kind AI bill is winding its way through California, causing infighting between groups of AI pioneers.
Exodus at OpenAI: Nearly half of AGI safety staffers have left, says former researcher.	Nearly half the OpenAI staff that once focused on the long-term risks of superpowerful AI have left the company in the past several months, according to Daniel Kokotajlo, a former OpenAI governance researcher.
Technology may be advancing - but it’s making us more stupid.	‘Deskilling’ in the face of cognitive automation is a problem that is too easily ignored
Inference is FREE and INSTANT.	Large language models (LLMs) may not be much better at reasoning, but they will be more helpful for repeated jobs due to their rising speeds and falling prices. These models may not have genuine understanding, yet they are nonetheless capable of handling simple tasks effectively.
UK’s new science minister on budget battles, Brexit and AI leadership.	Former clinical scientist Patrick Vallance speaks to Nature about his priorities as the minister overseeing the nation’s research.
Urgently clarify how AI can be used in medicine under new EU law.	The European Union’s Artificial Intelligence Act entered into force on 1 August. Phased implementation begins in February 2025, banning artificial intelligence (AI) systems deemed to pose unacceptable risks. Before that happens, policymakers must do more to ensure that patients’ safety and interests are protected.

Back to index

ML news: Week 19 - 25 August

Research

Link	description
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery.	a novel artificial intelligence (AI) agent that, for less than $15, can develop and write a full conference-level scientific paper; it automates scientific discovery by empowering frontier LLMs to conduct independent research and summarize findings; it also uses an automated reviewer to assess the papers it generates; it claims to achieve near-human performance in assessing paper scores; and it claims to generate papers that, according to their automated reviewer, surpass the acceptance threshold at a premier machine learning conference.
LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs.	suggests AgentWrite as a way to allow off-the-shelf LLMs to produce coherent outputs longer than 20K words. AgentWrite divides the long generation task into smaller tasks and uses a divide-and-conquer strategy to produce the outputs; the agent then splits the task into smaller writing subtasks and concatenates the outputs to produce a final output (i.e., plan + write). This method is then used to create SFT datasets, which are used to tune LLMs to produce coherent longer outputs automatically; a 9B parameter model, further enhanced through DPO, achieves state-of-the-art performance on their benchmark and outperforms proprietary models.
EfficientRAG: Efficient Retriever for Multi-Hop Question Answering.	trains a filter model to formulate the next-hop query based on the original question and previous annotations; this is done iteratively until all chunks are tagged as or the maximum # of iterations is reached; after the above process has gathered enough information to answer the initial question, the final generator (an LLM) generates the final answer. trains an auto-encoder LM to label and tag chunks; it retrieves relevant chunks, tags them as either or , and annotates chunks for continuous processing.
RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation.	a detailed assessment methodology for RAG retrieval and generating module diagnosis; demonstrates that RAGChecker exhibits superior correlations with human judgment; presents multiple illuminating patterns and trade-offs in RAG architecture design decisions.
HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction.	integrates VectorRAG and GraphRAG to create a HybridRAG system that performs better than either one separately; it was tested on a set of transcripts from financial earning calls. When the benefits of both methods are combined, questions can be answered with more accuracy.
Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers.	introduces self-play mutual reasoning to enhance small language models' reasoning powers without the need for better models or fine-tuning; To create richer reasoning trajectories, MCTS is enhanced with human-like reasoning actions derived from SLMs; The target SLM chooses the last reasoning trajectory as the solution, while another SLM offers unsupervised input on the trajectories; For LLaMA2-7B, rStar increases GSM8K accuracy from 12.51% to 63.91% while steadily raising other SLM accuracy.
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters.	explores how inference-time computation in LLMs scales. Specifically, it examines how much an LLM can be improved given a fixed amount of inference-time compute; it discovers that the efficacy of various scaling strategies varies by prompt difficulty; it then suggests an adaptive compute-optimal strategy that can increase efficiency by more than 4x when compared to a best-of-N baseline; it reports that optimally scaling test-time compute can outperform a 14x larger model in a FLOPs-matched evaluation.
Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation.	a graph-based framework for the medical domain that improves LLMs and produces evidence-based results; makes use of chunk documents and a hybrid static-semantic approach to enhance context capture; uses graphs to represent entities and medical knowledge, creating an interconnected global graph; This method outperforms cutting-edge models and increases precision across several medical Q&A metrics.
BAM dense to MoE Upcycling.	By using this technique, the FFN and Attention layers of dense models can be recycled into a Mixture of Experts (MoE) model for additional training. This preserves downstream performance while saving a significant amount of computing expense.
BAPLe: Backdoor Attacks on Medical Foundational Models using Prompt Learning.	Backdoor attacks can be incorporated into medical foundation models using the BAPLe technique during the prompt learning stage.
ShortCircuit: AlphaZero-Driven Circuit Design.	AI-powered automation and optimization of chip design can lower costs while satisfying the need for more powerful chips. Using an Alpha Zero-based approach, this method was tested on numerous circuits and produced small and effective designs with an 84.6% success rate.
Automated Design of Agentic Systems.	This study examines the fragility of current agent systems and explores potential future directions for the design of learning systems. Programming languages are used by their creators as a testbed where unsupervised agent creation and execution are possible.
Loss of plasticity in deep continual learning.	The pervasive problem of artificial neural networks losing plasticity in continual-learning settings is demonstrated and a simple solution called the continual backpropagation algorithm is described to prevent this issue.
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model.	Incredible new model from Meta that performs diffusion and next token prediction on text and image interleaving. It performs comparably to earlier generation devices like Dalle 2 and Llama 2 in benchmark tests for text and graphics.
To Code, or Not To Code? Exploring Impact of Code in Pre-training.	The industry keeps this to itself, although pretraining models on code aid in their generalization to other reasoning-intensive activities. This Cohere study investigates that issue in detail and demonstrates that code may be used as a foundational element of thinking in a variety of contexts.

News

Link	description
AI-generated parody song about immigrants storms into German Top 50.	Artist Butterbro accused of walking fine line between parody and discrimination and helping make racial slur mainstream
Tesla faces lowest duty on Chinese-made cars exported to EU.	The 9% tariff is much less than others face after investigation into Beijing’s ‘unfair’ subsidies of EVs
Google’s upgraded AI image generator is now available.	Google says Imagen 3 is its highest-quality image generator so far — and now more users in the US can try it.
Runway’s Gen-3 Alpha Turbo is here and can make AI videos faster than you can type.	The new Gen-3 Alpha Turbo from Runway ML is currently available with a variety of subscription plans, including free trials, and offers 7x quicker AI video creation at half the cost of its predecessor. The time lag is greatly decreased by this speed increase, which promotes more productive workflows, especially in industries where time is of the essence. Runway is negotiating the ethical waters of AI training data practices while pushing for more advancements, such as improved control systems.
Eric Schmidt Walks Back Claim Google Is Behind on AI Because of Remote Work.	Eric Schmidt, ex-CEO and executive chairman at Google, walked back remarks in which he said his former company was losing the artificial intelligence race because of its remote-work policies.
Gemini Advanced updated with latest 1.5 Pro model for improved reasoning.	Google has enhanced Gemini 1.5 Pro in Gemini Advanced, delivering improved responses for prompts requiring advanced reasoning and coding.
Waymo is developing a roomier robotaxi with less-expensive tech	Waymo has revealed its Generation 6 self-driving technology that is built into Geely Zeekr EVs and requires less cameras and sensors. With the help of machine intelligence and semiconductor developments, the Alphabet division intends to quickly implement this technology to survive a variety of weather conditions. With this update, Waymo is able to continue scaling its Waymo One service, which is presently offering 50,000 trips each week.
Gemini Live could use some more rehearsals.	Google's AI-powered voice interaction technology, Gemini Live, attempts to replicate genuine speech but has trouble with errors and hallucinations. It isn't as customizable or expressive as rivals like OpenAI's Advanced Voice Mode, even though it uses professional actors for more expressive voices. Overall, the bot's usefulness and purpose are unclear due to its limited capability and dependability concerns, especially considering that it is a component of Google's expensive AI Premium Plan.
Hamming Launches 100x faster testing of voice agents.	With the use of a technology called hamming, you may test hundreds of situations for your voice AI systems and create personalities that resemble Character AI.
Fine-tuning now available for GPT-4o.	With the announcement of fine-tuning for GPT-4o, OpenAI enables developers to tailor the model using their datasets for certain use cases. Through September 23, it will be giving away one million free training tokens per day.
OpenAI strikes search deal with Condé Nast.	With the signing of a multi-year licensing deal, OpenAI and Condé Nast can integrate content from the publisher's brands, like Vogue and The New Yorker, into their ChatGPT and SearchGPT platforms.
Meta’s Self-Taught Evaluator enables LLMs to create their own training data.	Meta FAIR researchers have introduced the Self-Taught Evaluator, a method to train evaluative LLMs without human annotations, potentially enhancing the efficiency and scalability of LLM assessment. Using the LLM-as-a-Judge concept, it iteratively generates and refines responses to create a training dataset, demonstrating improved performance on benchmarks like RewardBench. This technique could enable enterprises to leverage unlabeled data for LLM tuning while acknowledging the importance of a well-aligned seed model and the limitations of benchmarks.
Video: $16,000 humanoid robot ready to leap into mass production.	China's Unitree Robotics is a relatively recent entry in the general-purpose humanoid robot space, but its $16,000 G1 model is already proving itself to be quite the performer. So much so that the company has now revealed a version that's ready for mass production.
US mayoral candidate who pledged to govern by customized AI bot loses race.	Victor Miller proposed customized ChatGPT bot to govern Cheyenne, Wyoming – but fared badly at the ballot box
Authors sue Anthropic for copyright infringement over AI training.	Andrea Bartz, Charles Graeber and Kirk Wallace Johnson allege company misused work to teach chatbot Claude
Ideogram 2.0.	A new model from Ideogram has better text rendering and image-generating capabilities.
Introducing Zed AI.	With the help of a hosted service called Zed AI, developers may employ LLMs and yet have complete control over their code by integrating AI-powered coding into the Zed text editor. Zed and Anthropic have teamed up to enable quick editing with Claude.
Nvidia’s AI NPCs will debut in a multiplayer mech battle game next year.	Nvidia ACE, the company’s AI-powered system for giving voices and conversation skills to in-game characters, is set to debut in Mecha Break, a new multiplayer mech battle game coming to PC, Xbox X / S, and PlayStation 5 in 2025.
These 'living computers' are made from human neurons — and you can rent one for $500 a month.	Using human-brain organoids into computing, FinalSpark's "Neuroplatform" provides a biocomputing platform that may be rented to lower AI's energy consumption. Standardizing production and increasing the life of organoids beyond 100 days are challenges. Alternatives such as fungal networks and cellular computing are also investigated for jobs that are beyond the capabilities of silicon-based computers.
AI made of jelly ‘learns’ to play Pong — and improves with practice.	Inspired by neurons in a dish playing the classic video game, researchers show that synthetic hydrogels have a basic ‘memory’.
Cursor raises $60m.	Cursor raised a Series A to continue building its AI-powered coding IDE.
Perplexity AI plans to start running ads in the fourth quarter as AI-assisted search gains popularity.	The AI-assisted search startup Perplexity AI, which just raised $1 billion in funding, intends to launch adverts on its search app in Q4.
Pixel 9 phones: The Gemini AI stuff, reviewed.	One of the main features of the Pixel 9 phones is Google's Gemini AI, which provides customers with several AI-powered features like task assistance, picture editing, and screenshot management. Its effectiveness as a full-fledged assistant is uneven, though, with sporadic hiccups and several Google Assistant functions that aren't completely incorporated. Notwithstanding these problems, Pixel users can benefit from intriguing features like document summarizing and creative photo "reimagining" tools.
AMD explains its AI PC strategy.	With its Ryzen AI 300 CPUs, AMD is pushing the AI PC industry forward by incorporating NPUs to improve AI-powered applications such as Microsoft's Recall.
Gemini in Gmail can now help polish up your drafts.	‘Help me write’ can now polish your emails, in addition to being able to formalize them or shorten them.
Royal Society facing calls to expel Elon Musk amid concerns about conduct.	Some fellows fear tech billionaire could bring the institution into disrepute with incendiary comments
Apple Intelligence is coming. Here’s what it means for your iPhone.	Apple is about to launch a ChatGPT-powered version of Siri as part of a suite of AI features in iOS 18. Will this change the way you use your phone – and how does it affect your privacy?

Resources

Link	description
A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?	a thorough rundown of NL2SQL approaches driven by LLMs, including models, data gathering, assessment strategies, and error analysis
DeepSeek-Prover-V1.5.	Process supervision was used to train DeepSeek's extremely potent math model, which performs noticeably better than larger models on several MATH benchmarks.
DifuzCam: Replacing Camera Lens with a Mask and a Diffusion Model.	This is a fun project that reconstructs very low-quality images from a cheap camera using a diffusion model.
Knowledge Fusion of Large Language Models.	Several models can be combined with Fuse Chat, allowing each to contribute their unique capabilities. This is the code base containing the model weights for several robust 7B models that achieve good results on the MT bench.
SigmaRL.	The goal of the decentralized, open-source SigmaRL framework is to enhance the generalization and sample efficiency of multi-agent Reinforcement Learning (RL) in the context of motion planning for automated and networked vehicles.
Comparative Evaluation of 3D Reconstruction Methods for Object Pose Estimation.	To evaluate how the quality of 3D reconstructions affects object position estimate accuracy in industrial applications, this work presents a thorough benchmark.
MVInpainter: Learning Multi-View Consistent Inpainting to Bridge 2D and 3D Editing.	The process of producing many views from a single image is known as multi-view image synthesis.
BLIP-3.	For a while, BLIP was the most used multimodal model. The most recent iteration employs a pure autoregressive loss and is noticeably simpler. It attains cutting-edge results on certain captioning benchmarks.
SAM2-UNet: Segment Anything 2 Makes Strong Encoder for Natural and Medical Image Segmentation.	A new image segmentation framework called SAM2-UNet uses the potent Segment Anything Model 2 (SAM2) as its encoder.
A Survey on Benchmarks of Multimodal Large Language Models.	A thorough analysis of 180 benchmarks for Multimodal Large Language Model evaluation is presented in this work.
SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering.	You can create an editable and animatable mesh output from a video or image series using mesh reconstruction from Gaussian splatting. It just takes a few steps on a single GPU to accomplish this, and it does so very rapidly and efficiently.
Llama-3.1 Storm Models.	These are the first tuned models that significantly outperform Meta's Llama-3.1 base models.
EasyRec: Simple yet Effective Language Model for Recommendation.	EasyRec is a language paradigm created especially for jobs involving recommendations. To produce high-quality semantic embeddings, it makes use of cooperative data from several datasets and creative contrastive learning objectives.
Classifying all of the pdfs on the internet.	A wonderful post about classifying every PDF available on the internet according to its semantic content using clever prompting and embeddings.
How to get from high school math to cutting-edge ML/AI: a detailed 4-stage roadmap with links to the best learning resources that I’m aware of.	Software experts can use the following four-step learning plan to comprehend advanced ML/AI papers: Basic math (calculus, algebra, linear algebra, probability, statistics), deep learning (multi-layer neural networks), classical machine learning (basic regression, classification models), and cutting-edge machine learning (transformers, LLMs, diffusion models) are the first four areas of study in machine learning. For stages 1-2, author-created content is essential, while for stages 3–4, suggested outside items are necessary. Once each level is mastered, students are better prepared to take on challenging ML papers and keep up with the rapidly advancing field of AI research.
llamafile v0.8.13	Whisper models are now supported by Llama files, which also offer a number of speed and quality-of-life enhancements.
MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model.	A quick, affordable, and cutting-edge approach for creating 3D meshes that can be trained on text or images. In particular, it employs a cascade of steps, such as a normal map generator, that transfers distinct duties to different submodels and signed distance function supervision.
NeuFlow_v2.	Optical flow code that is incredibly quick and effective and suitable for low-power devices like phones and certain security camera systems.
X-ray Report Generation.	To produce X-ray medical reports more efficiently and with less computer complexity, a new framework was created.
TraDiffusion：Trajectory-Based Training-Free Image Generation.	A novel technique called TraDiffusion uses mouse trajectories rather than box or mask controls to guide text-to-image generation.
Loss Rider.	A fun utility that illustrates when loss functions converge and get too spiky by animating a curve rider sled as it descends them.
kyScript-100M: 1,000,000,000 Pairs of Scripts and Shooting Scripts for Short Drama.	The goal of the large dataset SkyScript-100M is to improve the production of excellent shooting scripts for short dramas.
NeuFlow v2: High-Efficiency Optical Flow Estimation on Edge Devices.	This work presents a novel approach to optical flow estimation that delivers excellent accuracy at a large computational cost savings.
Torch-Pruning.	repository of cutting-edge techniques with numerous supported algorithms for language model pruning that is kept up to date.
Image, Tell me your story!.	A novel strategy for identifying visual misrepresentation has been presented by researchers, which emphasizes the importance of the original meta-context of images—a factor that automated approaches frequently ignore.
Pathology-LLaVA.	Pathology image analysis is the target application for PA-LLaVA, a domain-specific language-vision assistant.
Microsoft's Phi-3 family.	A detailed analysis of the MoE and vision model from Microsoft's recently released Phi 3.5 models.
The Top 100 Gen AI Consumer Apps - 3rd Edition.	Based on customer interaction patterns, Andreessen Horowitz's most recent consumer AI research ranks the top 100 generative AI apps and divides them into the top 50 AI online products and the top 50 AI mobile apps. The research offers in-depth analyses of trends, new competitors in the sector, and developing categories.
Eight basic rules for causal inference.	This comprehensive blog article explains the relationship between causal mechanisms and observable correlations using R code simulations, causal graphs, and logic concepts to illustrate the seven basic laws of causal inference.
Jamba-1.5.	AI21 has released new versions of its hybrid Transformer and State space model architecture.
biorecap: an R package for summarizing bioRxiv preprints with a local LLM.	The recently released biorecap R package uses locally run big language models to fetch and summarize recent publications, assisting academics in managing the massive amount of bioRxiv preprints.
aurora.	Microsoft's high-quality atmospheric prediction model, code, and checkpoints are available as open source.
NuSegDG.	A novel framework named NuSegDG has been created by researchers to improve the generalizability of nuclei segmentation in various medical pictures.
Pano2Room: Novel View Synthesis from a Single Indoor Panorama.	Pano2Room is a novel technique that overcomes limitations in single-view 3D scene synthesis by reconstructing high-quality 3D indoor scenes from a single panoramic image.
Awesome Object-Centric Robotic Manipulation.	This repository offers a thorough introduction to embodied learning, a promising robotic manipulation methodology that prioritizes perceptual feedback and physical interaction.

Perspectives

Link	description
‘Threads is just deathly dull’: have Twitter quitters found what they are looking for on other networks?	There’s been an exodus of users from X, propelled by Elon Musk’s lurch to the far right, but the alternatives have drawbacks too
Five ways the brain can age: 50,000 scans reveal possible patterns of damage.	Results raise hopes that methods could be developed to detect the earliest stages of neurodegenerative disease.
An AI Empire.	As AI develops, mankind may surpass other species as the most intelligent on Earth. AGI may not be far off, as it might allow AI research to be replicated on a never-before-seen scale. The exponential rise in computing suggests that humans will soon become significantly less relevant as AI takes over. Despite possible roadblocks in AI development, society might not be prepared for such a significant transformation.
What does Bitcoin smell like? AI startup wants to ‘teleport’ digital scents.	A firm focused on artificial intelligence called Osmo is creating technology that will allow computers to recognize and replicate smells, which might help with disease detection and digital scent communication. Scent detection lacks a defined "smell map," which makes it more difficult for the team to create a molecular bond scent database than audiovisual AI advancements. Osmo's applications, which integrate olfactory sensations, have the potential to transform digital marketing and medical diagnostics.
Eric Schmidt’s AI prophecy: The next two years will shock you.	In the next years, former Google CEO Eric Schmidt believes that artificial intelligence will evolve quickly and might produce important apps similar to TikTok rivals in a matter of minutes. He draws attention to the unpredictable and rapid advancements in AI, noting the possibility of massive technological and economic disruption from the convergence of agent-based systems with text-to-action capabilities and big language models. Schmidt's perspective indicates a revolutionary age ahead, reflecting the significant investments and energy requirements expected for cutting-edge AI development.
Why Neuralink’s Blindsight and Brain Implants to restore sight won’t work like human eyesight.	This study emphasizes the difficulties in using AI-powered cortical implants to restore vision by highlighting the fact that neurons in the visual cortex do not behave like pixels on a screen. Although high-resolution simulations are promising, cortical implants cannot achieve genuine vision since doing so would entail reproducing intricate neural patterns, which is far beyond the capabilities of present technology and will result in pixelated and subpar images.
A Personalized Brain Pacemaker for Parkinson’s.	Researchers have created an adaptive method of deep brain stimulation that greatly shortens the duration of symptoms by adjusting electrical pulses to the various symptoms experienced by Parkinson's sufferers.
Why Diffusion could help LLMs reason.	Present-day language models anticipate words one at a time, leaving very little opportunity for reasoning and planning. This can be avoided by using techniques like Chain of Thought prompting. To enhance model reasoning, diffusion models—which have the capacity to spend more diffusion steps per token—might be used.
AI companies are pivoting from creating gods to building products. Good.	The preparedness of generative AI for broad commercial applications has been overstated by AI businesses, which has resulted in expensive errors in product development and market integration. They have five major obstacles to overcome to change direction: making sure that the system is affordable, boosting security and safety, protecting privacy, and optimizing user interfaces. These challenges draw attention to the discrepancy between the potential of AI and the actual difficulties in implementing AI systems that satisfy user expectations and fit in with current processes. Rather than occurring in the quick timeframe some have projected, the route to broad adoption will probably take ten years or longer.
Has your paper been used to train an AI model? Almost certainly.	Artificial intelligence developers are buying access to valuable data sets that contain research papers — raising uncomfortable questions about copyright.
The testing of AI in medicine is a mess. Here’s how it should be done.	Hundreds of medical algorithms have been approved on the basis of limited clinical data. Scientists are debating who should test these tools and how best to do it.
Light bulbs have energy ratings — so why can’t AI chatbots?	The rising energy and environmental cost of the artificial intelligence boom is fuelling concern. Green policy mechanisms that already exist offer a path towards a solution.
How the human brain creates cognitive maps of related concepts.	Neural activity in human brains rapidly restructures to reflect hidden relationships needed to adapt to a changing environment. Surprisingly, trial-and-error learning and verbal instruction induce similar changes.
Switching between tasks can cause AI to lose the ability to learn.	Artificial neural networks become incapable of mastering new skills when they learn them one after the other. Researchers have only scratched the surface of why this phenomenon occurs — and how it can be fixed.
Markov chains are funnier than LLMs.	This article explores LLM predictability and its limitations when it comes to producing humor. It makes the case that although LLMs are excellent at producing text that is appropriate for the context, their predictive nature renders them unsuitable for humorous writing, which depends on unexpectedness.
AI at Work Is Here. Now Comes the Hard Part.	In the last six months, the use of generative AI has almost doubled globally, with 75% of knowledge workers currently using it.
AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work.	This is a lengthy and comprehensive overview of the research that DeepMind is doing on AGI safety and alignment.
The newest weapon against mosquitoes: computer vision.	Developments in computer vision are helping combat malaria by enabling applications such as VectorCam, which facilitates fast identification of mosquito species and data gathering. The Gates Foundation helped develop the app, which can identify species that transmit malaria and aid in improving disease control tactics. Innovative mosquito surveillance techniques are essential for the tactical use of pesticides and other mitigating actions.
Fields that I reference when thinking about AI takeover prevention.	This article compares fields battling insider threats with AI control, offering ideas on developing and assessing strong AI safety measures. It emphasizes how much more control developers have over AIs than they do over people, but it also points out that, in contrast to humans, AI dishonesty can be endemic. AI control is different mainly because it is adversarial and doesn't involve complicated system interactions, even though it is influenced by different domains such as physical security and safety engineering.
‘Never summon a power you can’t control: Yuval Noah Harari on how AI could threaten democracy and divide the world.	Forget Hollywood depictions of gun-toting robots running wild in the streets – the reality of artificial intelligence is far more dangerous, warns the historian and author in an exclusive extract from his new book

Back to index

ML news: Week 12 - 18 August

Research

Link	description
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters.	This is an expansion of Ring Attention, which spans many GPUs to provide incredibly lengthy context. An energy function is derived by the researchers to guide the sharding of the models.
Bias-Aware Low-Rank Adaptation: Mitigating Catastrophic Inheritance of Large Language Models.	Bias propagation from pre-training data is addressed via a novel method for optimizing LLMs called bias-aware low-rank adaptation (BA-LoRA).
MM-Forecast: A Multimodal Approach to Temporal Event Forecasting with Large Language Models.	Researchers investigate how employing LLMs to improve temporal event predictions can benefit from photos. Two important roles of images are identified by their suggested framework, MM-Forecast: highlighting and supplementing textual data.
SAM 2: Segment Anything in Images and Videos.	an open, consistent approach for promptable, real-time object segmentation in photos and videos that can be used to visual content that hasn't been seen before without the requirement for special adaption; To facilitate precise mask prediction in videos, a memory mechanism is incorporated to retain data about the object and past interactions. Additionally, the memory module permits the processing of videos of any length in real-time. SAM2 considerably surpasses prior methods in interactive video segmentation over 17 zero-shot video datasets, all while requiring three times fewer human-in-the-loop interactions.
Structured Generation Limits Reasoning.	It examines whether structured generation can affect an LLM's capacity for reasoning and comprehensive domain knowledge; finds that when format constraints are applied, an LLM's reasoning skills significantly deteriorate in comparison to free-form responses; this degradation effect is exacerbated when stricter format constraints are applied to reasoning tasks.
RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation.	presents RAGFoundry, an open-source framework for enhanced LLMs for RAG use cases; it facilitates the generation of data-augmented datasets to fine-tune and assess LLMs in RAG situations. The system enables data creation, training, inference, and assessment.
Synthesizing Text-to-SQL Data from Weak and Strong LLMs.	suggests using integrated synthetic data to create the highly specialized SoTA text-to-SQL model known as SENSE; the use of strong models' synthetic data improves data variety, while the incorporation of important erroneous data from weaker models with an executor allows for the learning of execution feedback; By using preference learning to instruction-tune LLMs to learn from both correct and incorrect samples, SENSE closes the performance gap between open-source models and approaches utilizing closed-source models, achieving state-of-the-art scores on the SPIDER and BIRD benchmarks.
Conversational Prompt Engineering.	describes a two-step process that allows users to create personalized few-shot prompts by interacting with the model and sharing the output. The model shapes the initial instruction based on user-provided unlabeled data, and the user provides feedback on the outputs and instructions. This iterative process produces a personalized few-shot prompt that performs better and more optimally on the desired task.
Self-Taught Evaluators.	an approach to enhance model-based evaluators with only synthetic training data; it claims to outperform LLM-judges like GPT-4 and match top-performing reward models trained on labeled examples; it first generates contrasting outputs (good and bad model responses) and trains an LLM-as-a-Judge to produce reasoning traces and final judgments; the self-improvement scheme iteratively repeats the training process using its improved predictions.
UGrid: An Efficient-And-Rigorous Neural Multigrid Solver for Linear PDEs.	The UGrid solver is a recently created neural solver that combines the advantages of MultiGrid and U-Net methods for solving linear partial differential equations (PDEs).
Causal Agent based on Large Language Model.	The Causal Agent is an agent framework that can manage causal issues since it has memory, reasoning, and tool modules.
ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation.	Biases in CLIP can make it less effective in tasks like unsupervised semantic segmentation when images are not annotated. In this research, a technique to explicitly model and correct these biases is proposed.
Sakana Launches AI Scientist.	A system that can independently conduct research by formulating hypotheses, carrying out experiments, developing code, and compiling the findings into well-reasoned publications has been unveiled by the Japanese artificial intelligence company Sakana. Together with an open-sourced version of the system, the company has supplied samples of the papers the system wrote.
Small but Mighty: Introducing answerai-colbert-small.	ColBERT is a highly effective retrieval model. Despite having just 33 million parameters, this new model performs remarkably well on several measures. This article explains how to train a comparable model and what tips and techniques produced good results.
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation.	"Lazy visual grounding" is a two-step approach to open-vocabulary semantic segmentation that finds object masks independently of text and subsequently identifies the objects with textual information.
Introducing Agent Q: Research Breakthrough for the Next Generation of AI Agents with Planning & Self Healing Capabilities.	An agent educated by Multion to do web queries via self-play. It increased from 18% to 81% during training on a range of web-based tasks, such as placing restaurant orders. To get better, it employs DPO and MCTS. A publication from this work is published on the website, and researchers from Stanford also contributed to it. It seems to be based on Salesforce Research's xLAM function calling mechanism.
Anchored Preference Optimization.	Modifying models to conform to human tastes typically necessitates post-training. It is unclear, nevertheless, why one example is superior to another when these models are being trained. By using an existing example that has deteriorated, APO allows models to anchor the preference difference.
Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers.	Research on tree search for inference time computation for language models is very active. This Microsoft article presents a very strong argument for how small models can significantly outperform large models on mathematical tasks.
MetaSeg: MetaFormer-based Global Contexts-aware Network for Efficient Semantic Segmentation.	Based on the MetaFormer design, MetaSeg is a potent semantic segmentation network that improves the network's decoder and backbone.
Long Context RAG Performance of LLMs.	This article investigates the performance of long context models on several RAG tasks. Increasing the amount of examples can be beneficial. These models frequently break down in odd but expected ways.

News

Link	description
Uber highlights autonomous vehicle efforts now that Tesla’s in its rearview mirror.	Uber reported strong second-quarter results, with gross bookings and net profit both up decently. But the company has chosen to highlight the success of its autonomous vehicle effort, likely to assuage investors concerned about incoming competition from Tesla, which aims to reveal its first robotaxi in October.
Mistral: build, tweak, repeat.	With the introduction of LLM customizations by La Plateforme, such as Mistral Large 2 and Codestral, developers can now fine-tune models with specialized domain knowledge. The 'Agents' alpha release offers sophisticated, multi-layered processes that are integrated with the capabilities of Mistral Large 2. For Python and Typescript, the Mistralai SDK has reached a stable 1.0 release, which enhances consistency and usefulness.
Zico Kolter Joins OpenAI’s Board of Directors.	Expert in AI robustness and safety, Zico Kolter is a professor at Carnegie Mellon University. He just joined the Safety and Security Committee of OpenAI and the Board of Directors. His in-depth studies on model robustness, alignment, and safety in AI will strengthen OpenAI's endeavors to guarantee that AI serves humanity.
Apple changes EU App Store rules after commission charges.	Change in policy means developers will be able to communicate with customers outside App Store
World’s 1st AI-powered hearing aids boost speech understanding by 53 times.	With AI and dual-chip technology, Sonova has unveiled the Phonak Audéo Sphere, a hearing aid that promises a 53x improvement in speech understanding in noisy conditions. The technology, which took years to develop, uses the DEEPSONIC chip with enhanced DNN capabilities to address the main issue facing users of hearing aids: clarity in noisy environments. Sonova hopes that this technological advancement will greatly enhance the lives of those who are hard of hearing.
Apple Intelligence may come to EU after all…but only for Mac.	As per the most recent beta release notes, Mac users in the EU will get access to Apple's AI features in the next macOS Sequoia, unlike on iOS and iPadOS 18. Macs are not covered by the EU exclusion, which stems from problems with Digital Markets Act compliance. If Mac users have their system set to U.S. English, they should be able to access Apple Intelligence.
Waymo is expanding its robotaxi service areas in San Francisco and Los Angeles.	The company is looking to add more customers to its burgeoning driverless car business.
Intel reportedly gave up a chance to buy a stake in OpenAI in 2017.	According to reports, Intel decided against investing in OpenAI, which is currently a major participant in the AI space, in 2017–2018 because then-CEO Bob Swan doubted the industry's preparation for AI.
YouTube is testing a feature that lets creators use Google Gemini to brainstorm video ideas.	YouTube is testing integration with Google Gemini to help creators brainstorm video ideas, titles and thumbnails.
Forget Midjourney — Flux is the new king of AI image generation and here’s how to get access.	Black Forest Labs' Flux AI is the newest and most promising open-source AI image generating technology available. Laptops intended for consumers can run it. It is better at providing people and quick adherence than rivals such as Midjourney in certain areas. There are three versions of the model available: Pro, Dev, and Schnell. An open-source text-to-video model is being planned.
Paid Apple Intelligence features are likely at least 3 years away.	Some analysts this week started reporting that Apple could charge as much as $20/month for paid Apple Intelligence features. While that may be true, we likely won’t see Apple charging for these features for at least 3 years.
Elon Musk to pause X’s AI training on some EU data, Ireland says.	Des Hogan, the Irish Commissioner for Data Protection, has filed a lawsuit against an undisclosed business, contesting how it handles the personal data of EU citizens and perhaps affecting its AI chatbot's GDPR-compliant data processing procedures.
Intel is bringing GPUs to cars.	The Arc A760A is a discrete GPU for automobiles from Intel that aims to improve in-car entertainment through AI-powered capabilities like gesture and speech recognition.
US considers breaking up Google after illegal monopoly ruling, reports say.	DoJ could force divestment of Android operation system and Chrome web browser following antitrust verdict
Google launches Pixel 9 phones with advanced AI.	New Pixel phones, foldable, watch and earbuds feature Gemini Live for free-flowing conversations with AI bot
Grok-2 Beta Release.	The latest model from xAI, Grok 2, is a frontier class model with mathematical, coding, and reasoning abilities. To make FLUX available to X users, it is working with Black Forest Labs.
Prompt Caching With Claude.	Anthropic's Claude models now have prompt caching, which enables developers to cache context that is regularly utilized. This reduces costs and latency considerably, and early adopters like Notion are now enjoying faster and more effective AI-powered features.
OpenAI updates ChatGPT to new GPT-4o model based on user feedback.	Unannounced, OpenAI upgraded the GPT-4o model for ChatGPT, adding features based on user feedback but leaving the reasoning style unchanged. Users conjectured about improved multi-step reasoning and image-generating capabilities, but OpenAI made it clear that the model's reasoning remains unchanged. To improve developer experiences, the business also mentioned that the most recent version of ChatGPT could not be the same as the API version.
14 new things you can do with Pixel thanks to AI.	The Pixel Watch 3 uses sophisticated motion sensing and machine learning for better running form analysis, and it makes use of machine learning for automated sleep detection and mode modifications. It presents a Loss of Pulse Detection AI program that, if required, will automatically notify emergency services. Additionally, Pixel's AI-powered call screening and holding features are carried over to the watch.
MIT releases comprehensive database of AI risks.	The AI Risk Repository, a comprehensive database of over 700 verified AI dangers, was developed by MIT and other institutions to assist enterprises and researchers in assessing and mitigating evolving AI risks through the use of a two-dimensional classification system and frequently updated data.
Universal Music and Meta Announce ‘Expanded Global Agreement’ for AI, Monetization and More.	With an emphasis on equitable pay and resolving difficulties with unlicensed AI content, Meta and Universal Music Group have extended their multi-year licensing deal. This move aims to increase revenue and develop creative opportunities for UMG's artists on platforms such as Facebook, Instagram, and now WhatsApp.
As Alexa turns 10, Amazon looks to generative AI.	Despite having a high household penetration rate, Amazon's Alexa subsidiary lost $10 billion in 2022 and had to lay off employees, underscoring the unviability of its loss leader approach. With the growing apathy towards smart assistants such as Siri and Google Assistant, Amazon is relying on generative AI to boost user engagement and enhance Alexa's functionality. The company's main goals are to get around the "smart timer" restriction and improve conversational interactions.
Replika CEO Eugenia Kuyda says it’s okay if we end up marrying AI chatbots.	CEO of Replika Eugenia Kuyda recently talked about her vision for AI partners in human interactions, emphasizing the app's potential to provide romance, companionship, or therapy via avatars. Replika hopes to create a new class of connections by evolving LLMs to enhance human interaction rather than replace it. Even in the face of controversy—like brief bans on sexual content—the app's goal of enhancing users' mental health never changes. Replika, which employs 50–60 people and has millions of users, is preparing a big relaunch to improve dialogue realism and interaction.
Gemini 1.5 Flash price drop with tuning rollout complete, and more.	With a 78% reduction in input and a 71% reduction in output token costs, Gemini 1.5 Flash has experienced a pricing reduction. Additionally, its API is now supported in more than 100 languages.
Prediction marketplace Polymarket partners with Perplexity to show news summaries.	To incorporate event-related news summaries and data visualizations into its prediction marketplace, Polymarket has teamed up with AI search engine Perplexity.
Nouse Hermes 3.	Nous Research has released its flagship model. Trained on top of Llama 3, the model has strong performance and a great personality like many of the company's original models.
California AI bill SB 1047 aims to prevent AI disasters, but Silicon Valley warns it will cause one.	Silicon Valley is opposed to California's SB 1047, which aims to stop "critical harms" from massive AI models. Stakeholders are split on the bill's possible effects on innovation. Prominent businesses and industry leaders discuss the bill's benefits and implications for AI safety and advancement. The measure is headed for a final Senate vote. It mandates AI model safety protocols and third-party audits. It also outlines enforcement procedures and heavy fines for non-compliance.
SoftBank's Intel AI processor plans in doubt as insiders say it is now considering a TSMC partnership.	Intel failed to produce AI processors for SoftBank's Project Izanagi, leading SoftBank to explore a partnership with TSMC. Despite setbacks, SoftBank remains committed to challenging major AI players with its own hardware and data center ecosystem, potentially backed by significant investment from global partners. The move could strain SoftBank's relationship with Arm clients as it risks direct competition.
Another Apple smart ring patent granted, includes controlling smart glasses.	A smart ring that can monitor health and control other Apple devices is described in a recently awarded patent by Apple, which also refers to potential integration with AR/VR headsets and smart glasses.
Iranian group used ChatGPT to try to influence US election, OpenAI says.	AI company bans accounts and says operation did not appear to have meaningful audience engagement
Russia’s AI tactics for US election interference are failing, Meta says.	New Meta security report finds that AI-powered deception campaigns ‘provide only incremental’ results for bad actors

Resources

Link	description
Introducing sqlite-vec v0.1.0: a vector search SQLite extension that runs everywhere.	a vector database constructed using the potent SQLite framework. It offers a good vector API and can process millions of queries.
PufferLib.	To standardize the interface, PufferLib is a wrapper and accelerator for libraries related to reinforcement learning. It has many helpful baselines and is incredibly quick.
transtokenizers.	Trans-tokenization is a cross-lingual technique that uses language data from high-resource languages to improve language models for low- and mid-resource languages.
Survey of Mamba.	offers a thorough analysis of the Mamba-based models that are already in use across activities and domains; in particular, it concentrates on Mamba's improvements, methods for adjusting to a variety of inputs, applications where Mamba works well, and potential future research areas.
From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges, and Future.	a survey paper covering key subjects like requirement engineering, code generation, test generation, and autonomous decision making; it also includes benchmarks, metrics, and models used in various software engineering applications. The paper focuses on current practices and solutions for LLM-based agents for software engineering.
Transformer Explainer: Interactive Learning of Text-Generative Models.	It provides an open-source interactive application that allows you to experiment with your inputs while running a local GPT-2 instance in the user's browser to learn about the inner workings of a Transformer model.
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework.	provides a straightforward framework for automatically creating evaluation datasets to measure how well different LLMs are used in various contexts. It starts with seed documents to define a schema, then creates a variety of documents that result in question-answering pairs (QA pairs) that are based on both configurations and articles.
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2.	On the Gemma 2 model suite, DeepMind released several sparse autoencoders a few weeks ago. Researchers now talk about the training paradigm and some intriguing findings in this companion study.
LiDAR-Event Stereo Fusion with Hallucinations.	Researchers suggest combining a stereo event camera with a fixed-frequency LiDAR sensor as a way to enhance event stereo matching.
LLM-Aided OCR Project.	The LLM-Aided OCR Project is an advanced system designed to significantly enhance the quality of Optical Character Recognition (OCR) output. By leveraging cutting-edge natural language processing techniques and large language models (LLMs), this project transforms raw OCR text into highly accurate, well-formatted, and readable documents.
A Foundation Model for ECG Analysis.	A transformer-based foundation model called ECG-FM was created to lessen the requirement for a large amount of labeled data, thereby enhancing ECG analysis.
ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation.	ProxyCLIP is a novel framework that combines the advantages of Vision Foundation Models and CLIP models to enhance open-vocabulary semantic segmentation.
How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model.	Nvidia's Llama 3.1 minitron 4B variant is now available. Through knowledge distillation and pruning, the model achieved a 16% improvement in MMLU scores compared to training from scratch, while requiring 40 times fewer tokens.
A practitioner's guide to testing and running large GPU clusters for training generative AI models.	AI has produced an excellent manual for managing huge computer clusters for training generative AI models.
LongWriter: Unleashing 10,000+ Word Generation From Long Context LLMs.	With the use of AgentWrite, which divides large jobs into manageable chunks, models can now generate coherent outputs longer than 20,000 words.
OpenResearcher.	A new AI-powered platform called OpenResearcher seeks to provide answers to a variety of research-related queries.
Introducing SWE-bench Verified.	OpenAI has introduced a subset of SWE-bench that is easier and more in line with what humans and AI can solve today. It is a good benchmark for validating and working towards before running the entire original benchmark.
AI Toolkit.	An excellent assemblage of AI-related scripts and notebooks. It focuses a lot on image adjustment and synthesis.
flash-linear-attention.	a set of extremely effective Triton kernels for the most advanced linear attention models and their variations.
Vision-Language Model Evaluation with UniBench.	UniBench is a unified framework that combines more than 50 benchmarks into a single implementation, making the evaluation of vision-language models (VLMs) easier. It aids in evaluating how well VLMs are doing in a variety of domains, including as object identification and spatial awareness.
ClickAttention: Click Region Similarity Guided Interactive Segmentation.	Interactive segmentation is enhanced by a new click attention technique. This method lessens inter-click interference and increases the impact of positive clicks.
Universal Waveform Generation.	This article investigates the performance of long context models on several RAG tasks. Increasing the amount of examples can be beneficial. These models frequently break down in odd but expected ways.
Security Risks in Model Merging.	New security threats surface as Model Merging (MM), a common technique for merging optimized models without further training, gains traction. The first backdoor attack that targets MM specifically is described in this publication, called BadMerging.
Model Merging in LLMs, MLLMs, and Beyond Methods, Theories, Applications, and Opportunities.	This survey offers a thorough analysis of model merging strategies, a machine learning technique that is becoming more and more popular and doesn't require costly computation or raw training data.

Perspectives

Link	description
‘His rhetoric has made Tesla toxic’: is Elon Musk driving away his target market?	There are signs the billionaire is becoming unpopular with the very demographic group most likely to buy EVs
Why Elon Musk’s fun week of stirring up unrest shows the limits of our online safety laws.	Twitter under the tech owner has become the perfect test case for the UK’s new legislation – but critics say more needs to be done
Elon’s politics: how Musk became a driver of elections misinformation.	X owner, who will interview Trump on Monday, has cast doubt on mail ballots and spread false claims
Don't pivot into AI research.	In AI and machine learning, scale is now the primary factor influencing performance. Due to the significant cash needed, only a small number of suppliers can hire fruitful machine-learning researchers, resulting in market consolidation. The historical consolidation in chip design is reflected in this dynamic, which points to a potential future decline in the status and pay of machine learning positions when supply exceeds demand. In light of these industry changes, prospective ML professionals should carefully consider why they want to pursue a career in ML.
OpenAI Generates More Turmoil.	Just two of the eleven founding members of OpenAI are still in the company, indicating a high rate of turnover among the group as worries about the organization's move from its original non-profit goals to a more profit-driven structure mount. Co-founders Greg Brockman, who is taking a sabbatical, and Ilya Sutskever have also quit amid rumors of burnout and lucrative side benefits. The company faces difficulties since it could need to find a new significant financial partner and because it expects GPT-5 to come later than expected while the industry debates the benefits of "open" vs "closed" AI models.
Klarna’s AI chatbot: how revolutionary is it, really?	By integrating an AI chatbot created using OpenAI, Klarna may be able to cut down on the amount of support people it needs because of its notable efficiency in customer service duties. In 23 markets and more than 35 languages, the bot responds quickly to standard Level 1 support inquiries; however, it refers more complicated problems to human agents. The system reduces expenses and expedites first-level help, but compared to earlier L1 support automation, its revolutionary influence inside the business environment is questionable.
Why I bet on DSPy.	An open-source program called DSPy may coordinate several LLM calls to solve practical issues. The framework is being updated to solve current issues with accessibility and reliability, with a focus on verified input for outcome measurement. Even with restricted reasoning powers, LLMs can function well as creative engines in the DSPy framework.
LinkedIn is a mess. Here’s how to fix it.	The networking site one is calling a ‘cesspool’ is riddled with oversharing and lunatics – it’s time for change
Silicon Valley is cheerleading the prospect of human–AI hybrids — we should be worried.	A pseudo-religion dressed up as technoscience promises human transcendence at the cost of extinction.
TechScape: Why Musk’s rabble-rousing shows the limits of social media laws.	Twitter under the tech owner has become the perfect test case for the UK’s new legislation – but critics say more needs to be done
America & China's Chip Race.	The United States is implementing robust policies to enhance domestic semiconductor production using the CHIPS Act and sanctions designed to impede China's technological progress. China's semiconductor industry is booming despite these efforts, with near-record imports of manufacturing equipment and rising domestic chip production. This growing competition points to an ongoing geopolitical tug-of-war over the supremacy of the semiconductor supply chain.
Gas pipeline players in talks to fuel AI data center demand.	As the power demands of the AI industry rise, pipeline companies such as Energy Transfer LP and Williams Companies are in talks to feed natural gas directly to data centers.
Does AI Deserve A Seat At The Boardroom Table?	Leaders are being compelled to create strong AI strategies for data-driven decision-making as a result of AI's integration with corporate governance. Even though AI provides insightful information, particularly when used with LLMs, there are still issues, such as competence gaps and moral dilemmas. AI and human judgment must be properly balanced to support future C-suite decision-making.
Self-Driving Cars Are Still The Best Way To Solve The Biggest Problem With Driving In America.	Robocars promise to improve traffic even when most of the cars around them are driven by people, study finds
Brands should avoid AI. It’s turning off customers.	According to a recent study, consumers' desire to buy may be lowered when things are labeled as "AI-powered" because of mistrust and anxiety about the unknown. People are skeptical about AI's inner workings and threats, particularly about personal data protection, according to the research, which implies that both cognitive and emotional trust are important. It is suggested that instead of utilizing "AI" as a buzzword, businesses concentrate on communicating the advantages of AI.
14% of PCs shipped globally in Q2 2024 were AI-capable.	In Q2 2024, shipments of AI-capable PCs increased significantly to 8.8 million units or 14% of all PCs supplied.
Brain implants to treat epilepsy, arthritis, or even incontinence? They may be closer than you think.	Startups around the world are engaging in clinical trials in a sector that could change lives – and be worth more than £15bn by the 2030s

Back to index

ML news: Week 5 - 11 August

Research

Link	description
Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge.	Using this LLM-as-a-Meta-Judge approach enhances the LLM's ability to judge and follow instructions; simply self-improvement to produce better responses (act) saturates quickly; this work enhances the LLM's ability to judge itself (judge) to avoid issues like reward hacking; in addition to the act and judge roles, a third role called meta-judge is used to evaluate the model's judgments. This approach, known as meta-rewarding LLMs, proposes a self-improving alignment technique (no human supervision) where the LLM judges its judgments and uses the feedback to improve its judgment skills.
MindSearch: Mimicking Human Minds Elicits Deep AI Searcher.	In MindSearch, a multi-agent framework based on LLM is presented for complex web-information seeking and integration tasks. A web planner is utilized to efficiently break down complex queries, while a web searcher performs hierarchical information retrieval on the Internet to enhance the relevance of the retrieved information. An iterative graph construction is employed in the planning component to better model complex problem-solving processes. The multi-agent framework is better suited for handling long context problems by assigning retrieval and reasoning tasks to specialized agents.
Improving Retrieval Augmented Language Model with Self-Reasoning.	Enhanced RAG through Self-Reasoning - utilizes the reasoning trajectories produced by the LLM itself to offer an end-to-end self-reasoning framework that enhances the dependability and traceability of RAG systems. The LLM is utilized to do the following three procedures: This method helps the model be more selective, reason and distinguish relevant and irrelevant documents, thus improving the accuracy of the RAG system as a whole. 1) Relevance-aware: evaluates the relevance between the retrieved documents and the question; 2) Evidence-aware selective: selects and cites relevant documents, and then automatically selects key sentence snippets as evidence from the cited documents; and 3) Trajectory analysis: generates a concise analysis based on all gathered self-reasoning trajectories generated by the preceding 2 processes, and then provides the final inferred answer. Using only 2,000 training examples, the framework outperforms GPT-4. (produced by GPT-4)
Concise Thoughts: Impact of Output Length on LLM Reasoning and Cost.	Constrained-CoT, a model that restricts the reasoning output length without compromising performance, demonstrates that increasing the LLaMA2-70b's reasoning limit to 100 words increases accuracy on GSM8K from 36.01% (CoT) to 41.07% (CCoT) while lowering the average output length by 28 words.
ThinK: Thinner Key Cache by Query-Driven Pruning.	ThinK focuses on long-context scenarios and inference; it offers a query-dependent KV cache pruning method to minimize attention weight loss while selectively pruning the least important channels. HinK - aims to address inefficiencies in KV cache memory consumption.
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling.	A group of scientists discovered that benchmark performance can be significantly improved at a 3x lower cost than with a larger model if you sample from tiny models regularly, provided that you have adequate coverage and a verification tool.
Boosting Audio Visual Question Answering via Key Semantic-Aware Cues.	A Temporal-Spatial Perception Model (TSPM) has been established by researchers to enhance the capacity to respond to inquiries concerning auditory and visual signals in videos.
No learning rates needed: Introducing SALSA -- Stable Armijo Line Search Adaptation.	This work presents enhancements to line search strategies that improve the efficiency of stochastic gradient descent systems.
Automated Review Generation Method Based on Large Language Models.	Utilizing LLMs, researchers have created an automated approach for generating reviews to assist in managing the massive amount of scientific material.
CLEFT: Language-Image Contrastive Learning with Efficient Large Language Model and Prompt Fine-Tuning.	CLEFT is a Contrastive Learning technique meant for medical imaging that aims to overcome the drawbacks of current, resource-intensive CLIP-like methods.
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters.	To boost model performance, there is a lot of demand to leverage computation at inference time. This essay explores the trade-offs made between various approaches and presents several useful ones. This often suggests a larger trend of getting more performance out of smaller machines.
An Object is Worth 64x64 Pixels: Generating 3D Object via Image Diffusion.	It is easy to utilize a DiT model to generate unique things based on textual inputs by treating 3D objects as UV-wrapped images.

News

Link	description
Character.AI CEO Noam Shazeer returns to Google.	In a big move, Character.AI co-founder and CEO Noam Shazeer is returning to Google after leaving the company in October 2021 to found the a16z-backed chatbot startup.
Three New Additions To Gemma 2.	Google is expanding the Gemma 2 family of models with the addition of a new 2B parameter model, safety content classifier model, and model interpretability tool.
Microsoft says OpenAI is now a competitor in AI and search.	Microsoft’s annually updated list of competitors now includes OpenAI, a long-term strategic partner. The change comes days after OpenAI announced a prototype of a search engine. Microsoft has reportedly invested $13 billion into OpenAI.
Introducing GitHub Models.	We are launching GitHub Models, enabling our more than 100 million developers to become AI engineers and build industry-leading AI models.
Reddit CEO says Microsoft needs to pay to search the site.	In an interview, Steve Huffman calls out Microsoft’s Bing, Anthropic, and Perplexity for scraping Reddit’s data without permission. ‘It has been a real pain in the ass to block these companies.’
Elon Musk sues OpenAI again, alleging ‘deceit of Shakespearean proportions’.	Tesla CEO alleges his former partners, including CEO Sam Altman, manipulated him into co-founding the company
Google broke the law to maintain online search monopoly, US judge rules.	White House calls decision – that could have major implications for web use – ‘victory for the American people’
Secretaries of state called on Musk to fix chatbot over election misinformation.	X’s Grok AI chatbot falsely told users ‘ballot deadline has passed for several states’
Groq Raises $640M To Meet Soaring Demand for Fast AI Inference.	To address the demand for massive language model inference, Groq, the startup that is developing AI chips with lightning speed, is raising a significant amount of funding.
Elon Musk sues OpenAI, Sam Altman for making a “fool” out of him.	Having promised to keep OpenAI's technology open-source and prioritize the public good, Elon Musk has revived a lawsuit against the company and its CEO, Sam Altman. He claims that by turning OpenAI into a for-profit venture with ties to Microsoft, they obtained $44 million in seed funding fraudulently, which Musk claims betrays the original mission and has caused irreparable harm to both his interests and the public.
OpenAI Co-Founders Schulman and Brockman Step Back.	John Schulman has joined Anthropic as an independent contributor, while Greg Brockman is enjoying a long holiday.
Llama 3.1 Impact Grants.	Meta has announced a program to award groups using its models for good with $2m to help develop these tools for economically and socially impactful projects.
BYU engineering research finds key to quicker nuclear power: artificial intelligence.	Professor of chemical engineering at BYU Matt Memmott has created an AI algorithm that has the potential to drastically lower costs by ten years in the design and licensing of nuclear reactors. According to his team's study, AI can solve difficult nuclear design challenges far more quickly than conventional techniques; in one case, the design process was shortened from six months to just two days. The conclusions seek to preserve low electricity costs while meeting growing energy demands by speeding up the development of nuclear power.
OpenAI tempers expectations with less bombastic, GPT-5-less DevDay this fall.	According to OpenAI, this year's DevDay conference will no longer be a large event but rather a series of smaller, mobile developer sessions that will concentrate on upgrades to developer services and APIs rather than the introduction of a new flagship model.
Tezi raises $9M to launch Max: the first fully autonomous AI recruiter.	To build Max, an AI-driven recruiting agent that conducts hiring procedures from beginning to end on its own, Tezi raised $9 million in seed funding, with the lead investors being 8VC and Audacious Ventures.
Apple Intelligence rollout timetable won't delay iPhone 16.	Apple Intelligence capabilities will be added to iOS 18 after launch; initial access will be available to iPhone 15 Pro models exclusively in iOS 18.1.
Figure redesigns its humanoid robot from the ground up for slick new F.02.	California-based robotics outfit Figure has today announced its second-generation humanoid robot, which is initially being aimed at production lines in commercial settings, but the company is promising a bipedal butler in our homes shortly.
Structured Outputs in OpenAI API.	It is difficult to request organized output, such as JSON, from language models. With the help of this new functionality in OpenAI's API, language model creation may produce structured output that deterministic applications downstream can use.
Meta is reportedly offering millions to use Hollywood voices in AI projects.	To obtain broad usage rights across all of its platforms, Meta is negotiating to use the voices of well-known actors like Awkwafina and Judi Dench for its AI digital assistant. If a settlement is reached, the actors may receive millions of dollars in compensation, with SAG-AFTRA protecting likenesses created by AI. The business recently canceled a celebrity voice chatbot project, and now plans to showcase these AI technologies at its Connect conference in September.
With Smugglers and Front Companies, China Is Skirting American A.I. Bans.	A thriving underground market persists despite U.S. sanctions meant to stop the transfer of AI chips to China, facilitating large transactions such as the $103 million purchase using Nvidia processors. In an attempt to get around prohibitions, new businesses are founded, delivery methods are deceitful, and international distribution gaps are exploited. The ongoing illicit commerce has sparked discussions about the efficacy of American export regulations and how they affect US tech companies in comparison to their Chinese rivals.
Nvidia Blackwell GPUs allegedly delayed due to design flaws — launch expected to be pushed back by three months or more.	Microsoft, Meta, Google, and xAI will have to wait a few more months to receive their massive GPU orders.
OpenAI says it’s taking a ‘deliberate approach’ to releasing tools that can detect writing from ChatGPT.	OpenAI has built a tool that could potentially catch students who cheat by asking ChatGPT to write their assignments — but according to The Wall Street Journal, the company is debating whether to release it.
Zuckerberg touts Meta’s latest video vision AI with Nvidia CEO Jensen Huang.	Meta had a palpable hit last year with Segment Anything, a machine learning model that could quickly and reliably identify and outline just about anything in an image. The sequel, which CEO Mark Zuckerberg debuted on stage Monday at SIGGRAPH, takes the model to the video domain, showing how fast the field is moving.
Gemini intelligence is coming to Google Home.	Google Assistant is getting a major upgrade on Nest smart speakers and displays, and Nest cameras will soon be able to tell as well as show, as Google Home gets a powerful AI infusion
Zuckerberg says Meta will need 10x more computing power to train Llama 4 than Llama 3.	Meta, which develops one of the biggest foundational open source large language models, Llama, believes it will need significantly more computing power to train models in the future.
AMD is becoming an AI chip company, just like Nvidia.	AMD’s AI GPU sales just went from a billion dollars cumulatively to a billion dollars quarterly.
Microsoft Is Losing a Staggering Amount of Money on AI.	With an emphasis on data centers for AI capabilities, Microsoft's spending in AI jumped to $19 billion in the most recent quarter; nevertheless, significant AI revenue is yet unknown.
Taco Bell’s drive-thru AI might take your next order .	Taco Bell’s parent company aims to bring its ‘Voice AI’ technology to hundreds of stores in the US by the end of 2024.
OpenAI invests in a webcam company turned AI startup.	penAI is leading a $60 million funding round for Opal, the same company behind the high-end Tadpole webcam, according to a report from The Information.
UK regulator to examine $4bn Amazon investment in AI startup Anthropic.	Move is the latest of a string of CMA investigations into technology tie-ups
Hugging Face acquires XetHub.	The majority of the data that Hugging Face serves and keeps is in LFS. XetHub has developed a strong substitute for Git repositories' scalability.
Humane’s daily returns are outpacing sales.	The company is scrambling to stabilize as it hits $1 million in total returns against $9 million in sales.
GPT-4o System Card.	Setting up a voice system can be difficult. The ongoing efforts to guarantee the safety and usefulness of the multimodal paradigm are highlighted in this piece.
Fully-automatic robot dentist performs world's first human procedure.	In a historic moment for the dental profession, an AI-controlled autonomous robot has performed an entire procedure on a human patient for the first time, about eight times faster than a human dentist could do it.
Microsoft launches GitHub Models, offering 100 million developers easy access to leading AI tools.	Microsoft has introduced "GitHub Models," a new platform that enables over 100 million developers to integrate AI into their software projects by providing access to a variety of AI models. This includes popular models like Llama 3.1, GPT-4o, and Mistral Large 2, among others. Developers can explore these models for free through a built-in model playground on GitHub, where they can experiment with different prompts and model parameters.
Google brings Gemini-powered search history and Lens to Chrome desktop.	Google Thursday said that it is introducing new Gemini-powered features for Chrome’s desktop version, including Lens for desktop, tab compare for shopping assistance, and natural language integration for search history.
Apple changes EU App Store rules after commission charges.	Change in policy means developers will be able to communicate with customers outside App Store

Resources

Link	description
Adaptive Retrieval-Augmented Generation for Conversational Systems.	In addition to demonstrating the potential for RAG-based conversational systems to produce high-quality responses and high-generation confidence, Adaptive RAG for Conversations Sytems also develops a gating model that predicts whether a conversational system needs RAG to improve its responses. It further asserts that a correlation can be found between the relevance of the augmented knowledge and the generation's degree of confidence.
ShieldGemma: Generative AI Content Moderation Based on Gemma.	Based on Gemma 2, ShieldGemma provides a full suite of LLM-based safety content moderation models, including classifiers for major damage categories like toxicity, hate speech, and hazardous content.
PersonaGym: Evaluating Persona Agents and LLMs.	Assessing Persona Agents: This study suggests a standard for assessing persona agent skills in LLMs; it discovers that, while being a somewhat more sophisticated model, Claude 3.5 Sonnet only shows a 2.97% relative improvement in PersonaScore when compared to GPT 3.5.
The Art of Refusal: A Survey of Abstention in Large Language Models.	a review of the approaches currently employed in LLMs to attain rejection; offers measures and benchmarks for evaluation that are used to gauge abstinence in LLMs.
XHand: Real-time Expressive Hand Avatar.	A brand-new hand avatar called XHand is intended for real-time rendering in virtual worlds and video games. In contrast to earlier models, XHand concentrates on producing intricate hand morphologies, looks, and deformities.
Prompt Poet.	Millions of talks are served by Character AI's prompted construction library, which is made available to the public.
NAVIX: minigrid in JAX.	A popular testing bed for RL has been accelerated in JAX.
SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models.	A novel data synthesis pipeline for Vision Large Language Models (VLLMs) is called SynthVLM. Rather than captioning photos directly, SynthVLM leverages sophisticated diffusion models to produce high-resolution images from captions.
Networks that compress themselves.	You can train a more accurate, self-quantized model that gets smaller by integrating the network's size in the loss function.
Video Tracking with Language Embeddings.	A novel technique that leverages language embeddings to enhance point tracking in lengthy video sequences has been developed by researchers.
Boosting Efficiency in Vision-Language Model Training.	This effort addresses the imbalance brought about by different data distributions and model architectures by introducing a technique to balance computational burdens during large-scale 3D simultaneous training of vision-language models.
TexGen: Text-Guided 3D Texture Generation with Multi-view Sampling and Resampling.	High-quality generation of textures on 3d models with diffusion.
MeshAnything V2: Artist-Created Mesh Generation with Adjacent Mesh Tokenization.	This work uses textual, 2D, or 3D input to create artistic meshes. To sample effectively, it takes advantage of neighboring tokens and enhancements to the vertex representation.
CogVideo.	A text-to-video model available for free that performs nearly as well as closed video creation technologies.
MiniCPM-V.	Amazing vision language model with near real-time performance. It performs better on certain benchmarks than closed models.
RecDiffusion: Rectangle for Image Stitching with Diffusion Models.	RecDiffusion is a framework that improves the aesthetic appeal of stitched photos without requiring any cropping or distortion.
LLaVA-OneVision: Easy Visual Task Transfer.	In visual language models, there has been an effort to make them versatile and easy to tune. This reminds me of computer vision from ten years ago. Crucially, LLaVA-OneVision demonstrates how meticulous data curation and architecture upgrades may do this.
ABC Invariance.	To migrate your hyperparameters from smaller to larger models, use muP. A fantastic theorem that says you may vary where you scale model outputs and it won't affect ultimate transfer performance is demonstrated in practice in this GitHub gist.
XLabs-AI/flux-controlnet-canny.	XLabs has released the first Flux-Dev control net which allows for generation conditioned on Canny image inputs.
HARMONIC: Harnessing LLMs for Tabular Data Synthesis and Privacy Protection.	A framework called HARMONIC is used to create and assess synthetic tabular data by utilizing big language models.
Introducing Qwen2-Math.	A 72B math model developed by the Qwen team beats all other open and closed models on MATH. Additionally, it beats Llama-3.1-405B on some measures related to reasoning. Only English is available at this time; multilingual models will be available soon.
SAM2-PATH: A better segment anything model for semantic segmentation in digital pathology.	A novel approach called SAM2-PATH aims to improve semantic segmentation in digital pathology.
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond.	A new multilingual Spoken Language Understanding (SLU) dataset is called Speech-MASSIVE. It provides an analogous speech-based corpus to the massive text corpus.
PyTorch FlexAttention.	A new API from PyTorch makes it possible to design and compile any kind of attention variant to Triton. Better portability, performance, and research velocity on attention types are made possible by this.
A Language Model with Quick Pre-Training.	The "1.5-Pints" Language Model offers a novel method for pre-training that is compute-efficient. This model outperforms Apple's OpenELM and Microsoft's Phi in instruction-following tasks, as determined by MT-Bench, by curating a high-quality dataset of 57 billion tokens.
lighthouse.	Lighthouse is a user-friendly library for reproducible and accessible research on video moment retrieval (MR) and highlight detection (HD). It supports six VMR-HD models, three features, and five datasets for reproducible VMR-HD.

Perspectives

Link	description
AI existential risk probabilities are too unreliable to inform policy.	The use of AI existential risk probability estimates for policymaking is criticized in this essay, which contends that these estimates are excessively erratic and lack a strong inductive or deductive foundation, frequently approximating educated guesses rather than fact-based projections. The authors argue against the validity of using these projections to inform public policy, particularly when they are connected to expensive or restricting measures, and they support an evidence-based strategy that takes AI development uncertainty into account. They advise against utilizing speculative existential risk probability in high-impact decisions and instead suggest concentrating on specified AI milestones for more significant policy choices.
Is AI judging the future of gymnastics or just a surveillance tool?	To provide more equitable and transparent scoring, the International Gymnastics Federation (FIG) and Fujitsu have partnered to provide an AI-assisted judging support system at the World Gymnastics Championships. With room for future development and wider uses, the Judging Support System (JSS), which will not take the place of judges, provides 3D model-based second views in challenging cases and inquiry disagreements. The JSS may improve scoring accuracy and consistency, which is important in a sport where even small point variations have a significant impact on standings and players' careers, despite worries that it may replace human judgment.
Why AI’s Tom Cruise problem means it is ‘doomed to fail’.	LLMs’ ‘reversal curse’ leads it to fail at drawing relationships between simple facts. It’s a problem that could prove fatal
Sound clashes are a thrilling reggae tradition. Will AI ruin them?	The use of fake AI vocals – including those of Donald Trump – is sending shockwaves through this historic scene. At a Montego Bay clash, performers debate their culture’s future
Replacing my Right Hand with AI.	While riding a bike, an anthropic scientist broke their hand. They continued to be incredibly productive by leaning into Claude and his voice.
TPU transformation: A look back at 10 years of our AI-specialized chips.	Because it has invested in bespoke TPU chips, Google is one of the only companies training massive models without being dependent on Nvidia.
I'm Switching Into AI Safety.	Alex Irpan left Google's robotics team after eight years to join Google DeepMind's AI safety team. His move was motivated by a personal desire to address safety concerns as AI systems get closer to being superhuman. Though the area is difficult and fraught with controversy, they voice concerns about the effectiveness of present AI safety measures, the growing risks of unmanaged AI growth, and their dedication to contributing to AI safety.
As Regulators Close In, Nvidia Scrambles for a Response.	With a 90 percent share of the A.I. chip market, the company is facing antitrust investigations into the possibility that it could lock in customers or hurt competitors.
How GitHub harnesses AI to transform customer feedback into action.	GitHub is using AI and machine learning to compile and evaluate user input at scale, providing useful insights that drive feature prioritization and product enhancements. This automated method improves responsiveness to developer needs by facilitating the collection of multilingual input and promoting data-driven decision-making. The project demonstrates GitHub's dedication to utilizing AI to uphold a developer-centric approach to product development.
How Does OpenAI Survive?	The paper expresses a strong doubt regarding the sustainability of OpenAI, given the exorbitant costs associated with constructing and maintaining huge language models, as well as the absence of broad business utility for generative AI. The long-term sustainability of OpenAI is questioned by the author in the absence of substantial technology advancements or persistent, extraordinary fundraising efforts. Even though OpenAI has had a significant impact on the AI sector, the business still has issues with profitability, high operational burn rates, and a reliance on key alliances, most notably Microsoft.
How neurons make a memory.	Loosely packaged DNA might make these nerve cells better able to encode memories.
DeepMind hits milestone in solving maths problems — AI’s next grand challenge.	AlphaProof showed its prowess on questions from this year’s Mathematical Olympiad — a step in the race to create substantial proofs with artificial intelligence.
Dirty talk: how AI is being used in the bedroom – and beyond.	Analysis of more than 200,000 chatbot conversations shows how the new tech is actually being used. Turns out quite a lot of it is ‘racy role play’
Scientists are falling victim to deepfake AI video scams — here’s how to fight back.	Cybercriminals are increasingly singling out researchers, alongside politicians and celebrities. Targeted scientists share tips on how to silence them.
What lies beneath: the growing threat to the hidden network of cables that power the internet.	Last month large parts of Tonga were left without internet when an undersea cable was broken. It’s a scenario that is far more common than is understood
Why AI hasn’t shown up in the GDP statistics yet.	Even though LLMs have made remarkable strides in handling complicated tasks, they are still unable to reliably complete activities at a scale comparable to that of humans. As a result, their current potential as direct human substitutes in processes is limited. LLMs require comprehensive prompt engineering and iteration to reach acceptable accuracy. The latest JSON output control and cost reduction enhancements from OpenAI may help with certain problems, but the subtle integration needed for LLMs in corporate settings points to gradual productivity increases rather than a sudden economic revolution.
AI Is Coming for India's Famous Tech Hub.	AI integration is posing a danger to employment, particularly in routine operations like contact centers, which has caused a sea change in India's technology outsourcing sector. While recruiting is slowing down, companies are finding it difficult to move up the value chain. However, some are optimistic that AI technologies may open up new opportunities in fields like programming. Higher-order cognitive abilities will be necessary in the sector going forward as automation continues to reshape traditional employment.
Inside the company that gathers ‘human data’ for every major AI company.	Advances in AI pre-training have made it possible for models to handle large amounts of online data and supervised fine-tuning with specialists afterward aids in the models' ability to become more specialized and general. The goal of Turing's method is to improve AI reasoning capabilities by leveraging "input and output pairs" created by subject-matter experts. These models, foreseeing the "agentic" future of artificial intelligence, might integrate specialized knowledge across areas to accomplish complicated tasks independently.

Back to index

ML news: Week 29 July - 4 August

Research

Link	description
Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach.	compares RAG to long-context LLMs and discovers that while RAG is much less expensive, long-context LLMs perform better on average; Offers Self-Route, which routes inquiries to RAG or LC by using self-reflection; it claims to have a substantial computational cost reduction with a performance that is comparable to LC.
Recursive Introspection: Teaching Language Model Agents How to Self-Improve.	asserts that LLMs can be iteratively fine-tuned to improve their own response over multiple turns with additional feedback from the environment; the LLM learns to recursively detect and correct its past mistakes in subsequent iterations; and enhances 7B models' self-improvement abilities on reasoning tasks (GSM8K and MATH), achieving an improvement over turns that is not observed in strong proprietary models.
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference.	presents a novel dynamic token pruning technique for effective long-context LLM inference; it can maintain high accuracy while speeding up the prefilling stage of a Llama 2 7B model by 2.34 times; it computes the KV for tokens that are crucial for the next token prediction in both the prefilling and decoding stages; it enables language models to dynamically select different subsets of tokens from the context in different generation steps, even though they may have been pruned in a previous step.
Generation Constraint Scaling Can Mitigate Hallucination.	suggests a novel training-free method to reduce hallucinations in LLMs; they scaled the readout vector that limits generation in a memory-augmented LLM decoder; current research suggests that LLMs with explicit memory mechanisms can help reduce hallucinations; this work employs a memory-augmented LLM and applies lightweight memory primitives to limit generation in the decoder.
Align and Distill: Unifying and Improving Domain Adaptive Object Detection.	The difficulties of getting object detection models to perform well on a variety of data formats that they weren't initially trained on are addressed by a new method named ALDI.
Small Molecule Optimization with Large Language Models.	By gathering a dataset of 100 million molecules (40 billion token equivalent), two new language models were able to enhance their performance by 8% on the Practical Molecular Optimization benchmark.
The Larger the Better? Improved LLM Code-Generation via Budget Reallocation.	With a fairly comparable inference cost, code generation performance can be enhanced by repeatedly using smaller models.
Self-Directed Synthetic Dialogues and Revisions Technical Report.	More than 300,000 dialogues and criticisms will be incorporated into open models. The dataset, which was primarily produced with synthetics, is a potent illustration of synthetic data utilizing open models.
Theia: Distilling Diverse Vision Foundation Models for Robot Learning.	Theia, a vision foundation model for robot learning that combines several current vision models, is presented in this study. Rich visual representations provided by Theia improve robot learning even when using smaller model sizes and less training data. Test results indicate that Theia performs better than its predecessors, and the authors propose that enhanced performance is caused by more entropy in feature norms. The public is free to utilize the models and code.
Do We Really Need Graph Convolution During Training? Light Post-Training Graph-ODE for Efficient Recommendation.	A novel strategy to increase the effectiveness and scalability of recommender systems is called LightGODE. Adopting a continuous graph ODE and concentrating on post-training graph convolution, avoids the need for costly computations during training.

News

Link	description
Llama 3.1	a group of LLMs that includes models with 8B, 70B, and 405B parameters; it supports eight languages and expands the context window to 128K tokens; it exceeds state-of-the-art models in certain situations and competes favorably in other areas, including as general knowledge, math reasoning, and tool use.
Nvidia’s new Titan GPU will beat the RTX 5090, according to leak.	After skipping its ultra-expensive flagship graphics card with its Ada lineup, Nvidia could be bringing back the Titan with a Blackwell GPU.
Elon Musk will ‘discuss’ Tesla investing $5 billion in his private AI company.	Elon Musk says that he will ‘discuss’ Tesla investing $5 billion in xAI, his own private artificial intelligence company. For the last few years, Musk has claimed that “Tesla is an AI company.”
OpenAI training and inference costs could reach $7bn for 2024, AI startup set to lose $5bn - report.	In 2023, OpenAI projected that ChatGPT inference would cost about $4 billion on Microsoft's Azure servers, potentially resulting in large financial losses. Even though OpenAI is making about $2 billion a year from ChatGPT, it would need more money in less than a year to cover a $5 billion deficit. With subsidized prices from Azure, it presently uses the equivalent of 350,000 Nvidia A100 chip servers, primarily for ChatGPT.
Elon Musk sets new date for Tesla robotaxi reveal, calls everything beyond autonomy ‘noise’.	Elon Musk says he will show off Tesla’s purpose-built “robotaxi” prototype during an event October 10, after scrapping a previous plan to reveal it August 8. Musk said Tesla will also show off “a couple of other things,” but didn’t explain what that meant.
Stability AI steps into a new gen AI dimension with Stable Video 4D.	Stability AI is expanding its growing roster of generative AI models, quite literally adding a new dimension with the debut of Stable Video 4D.
Google’s Gemini AI is getting faster with its Flash upgrade.	Google’s Gemini AI chatbot will be able to respond to you more quickly and process more content in prompts thanks to an upgrade to the company’s Gemini 1.5 Flash AI model.
Introducing SAM 2: The next generation of Meta Segment Anything Model for videos and images.	Real-time promptable segmentation for videos and images from Meta.
Apple says its AI models were trained on Google’s custom chips.	Apple said in a technical paper on Monday that the two AI models underpinning Apple Intelligence, its AI system, were pre-trained on Google-designed chips in the cloud.
AI Startup Anthropic Faces Backlash for Excessive Web Scraping.	Freelancer.com CEO claims Anthropic's crawler violated the "do not crawl" protocol, causing site slowdowns.
Apple Intelligence Foundation Language Models.	Apple has outlined the basics of its language models for its newly announced “Apple Intelligence” initiative.
Microsoft beats revenue forecasts but poor performance of cloud services drags share price.	Firm’s earnings were up 15% year-on-year, but Azure’s lower returns resulted in share prices falling by as much as 7%
UK regulator looks at Google’s partnership with Anthropic.	CMA to consider whether the deal with AI startup is a potential merger, which could prompt full investigation
OpenAI has released a new ChatGPT bot that you can talk to.	The voice-enabled chatbot will be available to a small group of people today, and to all ChatGPT Plus users in the fall.
Meta's new AI Studio helps you create your own custom AI chatbots.	Headed for the web as well as Instagram, Messenger, and WhatsApp, AI Studio will let you build a chatbot that acts as a virtual extension of yourself.
Perplexity Will Soon Start Selling Ads Within AI Search.	Facing backlash for scraping publisher data, the young company says it’ll now compensate publishers whose content is used in answers to search questions.
The AI job interviewer will see you now.	AI interview services say they’re eliminating bias — but not everyone agrees. Companies are adopting AI job interview systems to handle incoming applicants. LLMs allow the interviewer to incorporate follow-up questions based on the subject’s response. Critics say the opaque models raise serious concerns about bias, particularly where there is no documentation about how a decision is made.
Canva buys Leonardo.	Leonardo, a generative picture firm, joins Canva to enhance the creative tools of both organizations.
Announcing Phi-3 fine-tuning, new generative AI models, and other Azure AI updates .	Updates to Azure AI have been released by Microsoft. These include PHI-3 model serverless fine-tuning, enhanced PHI-3-MINI performance, and the incorporation of models such as Meta's LLAMA 3.1 and GPT-4o mini into Azure AI.
Strong earnings report pushes Meta shares up amid heavy AI spending.	Stock price grew around 5%, which revealed the company outperformed analysts’ expectations for its second quarter
Argentina will use AI to ‘predict future crimes’ but experts worry for citizens’ rights.	President Javier Milei creates security unit as some say certain groups may be overly scrutinized by the technology
White House says no need to restrict ‘open-source’ artificial intelligence — at least for now.	The White House is coming out in favor of “open-source” artificial intelligence technology, arguing in a report Tuesday that there’s no need right now for restrictions on companies making key components of their powerful AI systems widely available.
Samsung hints at new products as it bets on AI to drive upgrades to its latest foldable phones.	Speaking to CNBC, Samsung Electronics’ mobile boss TM Roh discussed Galaxy AI and software strategy, while hinting at future foldable products and mixed reality headsets. Roh said the company hopes its suite of AI software will push users to upgrade to its latest smartphones.
Elon Musk calls Grok 'the most powerful AI by every metric' but 'secretly' trains the new model with your X data by default.	X's new experience is automatically set to opt-in and uses your data to train its Grok AI model.
NVIDIA Accelerates Humanoid Robotics Development.	To accelerate the development of humanoid robotics, NVIDIA has introduced new services and platforms, such as teleoperated data capturing workflows, OSMO orchestration, and NIM microservices.
US’ first robot-assisted dual kidney transplant performed in Ohio.	Joanne’s surgery was unique because doctors used the robotic surgical technique to implant two kidneys from a single deceased donor.
Intel announces plan to cut 15,000 jobs to ‘resize and refocus’ business.	Firm reported a loss in its second quarter and said it would cut 15% of its workforce to cut costs and compete with rivals
UK shelves £1.3bn of funding for technology and AI projects.	Britain’s first next-generation supercomputer, planned by Tories, in doubt after Labour government move
Black Forest Labs.	The founders of Latent Diffusion, Stable Diffusion, VQGAN, and other startups have raised over $30 million to launch their new business. They have introduced new flagship picture generation devices that are available in multiple levels and are incredibly competent.
OpenAI pledges to give U.S. AI Safety Institute early access to its next model.	OpenAI CEO Sam Altman says that OpenAI is working with the U.S. AI Safety Institute, a federal government body that aims to assess and address risks in AI platforms, on an agreement to provide early access to its next major generative AI model for safety testing.
The EU’s AI Act is now in force.	This starts the clock on a series of staggered compliance deadlines that the law will apply to different types of AI developers and applications. Most provisions will be fully applicable by mid-2026. But the first deadline, which enforces bans on a small number of prohibited uses of AI in specific contexts, such as law enforcement use of remote biometrics in public places, will apply in just six months.
Introducing Stable Fast 3D: Rapid 3D Asset Generation From Single Images.	A fantastic new quick and strong 3D generation model has been launched by Stability AI. Like the company's earlier versions, it operates under the same commercial license.
Introducing torchchat: Accelerating Local LLM Inference on Laptop, Desktop and Mobile.	A fantastic sample library for local language model chats has been made available by the PyTorch team. It can run the most recent Llama 3.1 models and comes with a reliable sample system.
Heeyo built an AI chatbot to be a billion kids’ interactive tutor and friend.	Xiaoyin Qu founded the firm Heeyo, which has released an AI-powered software with interactive games and a chatbot for kids three to eleven years old. With features like data protection and material created by child development specialists, the app strives to prioritize safety while offering tailored learning experiences. Though there may be worries about AI for children, Heeyo has raised $3.5 million in seed money. It presents itself as a secure and instructive substitute for well-known video and gaming platforms.
Cerebras IPO.	Cerebras Systems announced a proposal for IPO to the SEC.
LLMs breach a threshold.	FLOPs as a regulatory threshold have been the subject of dispute since Meta's open-source LLM Llama 3.1, trained on 3.8x10^25 FLOPs and equipped with 405B parameters, was recently released.

Resources

Link	description
OpenDevin: An Open Platform for AI Software Developers as Generalist Agents.	provides a framework for creating generalist agents that use software to interact with the outside world. Its features include 1) an interface for creating and executing code, 2) an environment with a sandboxed operating system and web browser accessible to the agents, 3) an interface for interacting with interfaces and environments, 4) support for multiple agents, and 5) an evaluation framework.
A Survey on Employing Large Language Models for Text-to-SQL Tasks.	gives an overview of using LLMs for Text-to-SQL operations, covering benchmarks, prompt engineering strategies, and fine-tuning procedures.
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens.	Open-source a massive multimodal interleaved dataset with 3.4 billion images and 1 trillion tokens; additional sources like PDFs and ArXiv papers are also included.
StreamMOS: Streaming Moving Object Segmentation with Multi-View Perception and Dual-Span Memory.	StreamMOS is a new approach for segmenting moving objects using LiDAR in autonomous driving and robotics.
Joint RGB-Spectral Decomposition Model Guided Image Enhancement in Mobile Photography.	Scientists have devised a technique that incorporates miniature spectrometers to enhance mobile photography. To improve image quality, this innovative method combines RGB and low-resolution multi-spectral images.
BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation.	A fresh and enhanced monocular depth model for numerous real-world situations.
3D Object Segmentation with Language.	RefMask3D is a technique that uses natural language descriptions to partition items in 3D point clouds. With Geometry-Enhanced Group-Word Attention and Linguistic Primitives Construction, the system improves vision-language feature fusion and tackles sparse and irregular point cloud problems.
Efficient Cell Segmentation.	A novel technique for high-accuracy cell segmentation, LKCell strikes a compromise between computational efficiency and broad receptive fields.
Tactics for multi-step AI app experimentation.	Typically, LLM programs have several components; this article examines various strategies along with pertinent code snippets.
AccDiffusion.	a technique that significantly enhances diffusion models' ability to synthesize high-quality images.
HybridDepth.	A depth estimate pipeline called HYBRIDDEPTH was created to address issues with scale ambiguity and technology variation in mobile augmented reality.
VSSD: Vision Mamba with Non-Causal State Space Duality.	A novel method for mitigating the high computing needs of vision transformers is the Visual State Space Duality (VSSD) paradigm.
A New Benchmark for Autonomous Agents.	AppWorld Engine is a sophisticated execution environment that features nine daily apps and 457 APIs
Crash Course in Deep Learning.	The creation and application of multi-layer perceptrons (MLPs), a kind of fully connected neural network used in deep learning, are covered in this article.
SaulLM-54B & SaulLM-141B: Scaling Up Domain Adaptation for the Legal Domain.	In this study, two huge language models with 54 billion and 141 billion parameters, respectively, that are intended for the legal industry, are introduced: SaulLM-54B and SaulLM-141B. The researchers used the Mixtral architecture to provide large-scale domain adaptation by aligning outputs with human legal interpretations, continuing pre-training using an extensive legal corpus, and adhering to a particular legal instruction-following procedure. The models provide state-of-the-art performance on LegalBench-Instruct and outperform earlier open-source models. These models' base, instruct, and aligned versions are available for reuse and group study under the MIT License.
WFEN.	To boost face super-resolution, researchers have created a feature augmentation network based on wavelets. The technique uses a full domain Transformer and breaks down input data into high and low-frequency components to improve facial details without generating distortions.
ChartQA-MLLM.	This experiment suggests a novel approach to multimodal large language models-based chart question answering.
DGFNet.	A novel method for forecasting the paths of several traffic participants in autonomous driving is called DGFNet. By taking into account the variations in difficulty between agents, recording detailed spatiotemporal data, and utilizing a difficulty-guided decoder, it improves predictions.
SAE for Gemma.	This demo is a beginner-friendly introduction to interpretability that explores an AI model called Gemma 2 2B. It also contains interesting and relevant content even for those already familiar with the topic.
Machine Unlearning in Generative AI: A Survey.	This in-depth analysis of generative AI examines machine unlearning. It addresses how to formulate problems, how to evaluate them, and the advantages and disadvantages of different approaches.
Elysium: Exploring Object-level Perception in Videos via MLLM.	A step toward providing object tracking and related tasks in films for Multi-modal Large Language Models (MLLMs) is represented by Elysium.
Piano Performance Generation.	The two-stage Transformer-based model for creating emotionally charged piano performances is presented in this paper.
3D Generative Model for Dynamic Scenes.	A 3D generative model called DynaVol-S is very good at extracting object-centric representations from unsupervised films.
Add-SD: Rational Generation without Manual Reference.	Add-SD is a program that uses short text prompts to put things into realistic environments. Unlike other methods, this one doesn't require bounding boxes or other explicit references.
Flow Matching: Matching flows instead of scores.	Diffusion models possess great strength. It can be difficult to understand them. Theoretically, flow matching is one way to view them. This blog delves further into the diffusion math of flow matching.
MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions.	MMTrail is a large-scale multi-modality video-language dataset with over 20M trailer clips, featuring high-quality multimodal captions that integrate context, visual frames, and background music, aiming to enhance cross-modality studies and fine-grained multimodal-language model training.
ARCLE - ARC Learning Environment.	ARCLE is an environment to aid reinforcement learning studies using the Abstraction and Reasoning Corpus (ARC).
Mishax.	DeepMind has released a library for studying language models via MI. The library helps with running models and functions from complex codebases without tons of importing headaches.
Engine Core.	Engine Core demonstrates a pattern for enabling LLMs to undertake tasks of a given scope with a dynamic system prompt and a collection of tool functions.
alphaXiv.	Open research discussion directly on top of arXiv

Perspectives

Link	description
My new iPhone symbolizes stagnation, not innovation – and a similar fate awaits AI.	Development of ChatGPT and its ilk will plateau, just like it did for smartphones, and then what are we left with? More ho-hum consumer tech
AI: Are we in another dot-com bubble?	A thorough examination by Translink Capital's Kelvin Mu contrasting the present AI cycle with the internet/telecom cycle of the 1990s. After comparing the two eras' technological, economic, and capital disparities, he comes to the conclusion that, even though a bubble may eventually occur, we are still a long way from there.
Robots sacked, screenings shut down: a new movement of Luddites is rising up against AI.	Company after company is swallowing the hype, only to be forced into embarrassing walk backs by anti-AI backlash
Chalkboards and What They Can Teach Us About Generative AI.	This article discusses the use of generative AI as a teaching tool and makes the case that the technology's compatibility with educational ideals should be taken into account in addition to its technical analysis. Although the author is receptive to the use of AI, she is wary of its potential effects and stresses the necessity for clear justifications for the use of particular resources in the classroom. The conversation compares and contrasts AI with conventional tools such as whiteboards, taking into account the educational and cultural consequences of each.
The Evolution of SaaS Pricing in the AI Era.	Because AI can automate work, the traditional seat-based pricing model in SaaS is becoming outdated. Work-based or outcome-based pricing models, which set prices according to the quantity of work AI completes or the results it achieves, are becoming more and more popular among businesses. While established players continue to use seat-based pricing, startups are utilizing innovative approaches to gain a competitive edge and more properly represent the value of AI.
TechScape: Will OpenAI’s $5bn gamble on chatbots pay off? Only if you use them.	The ChatGPT maker is betting big, while Google hopes its AI tools won’t replace workers, but help them to work better
New online therapies could help at least twice the number of people recover from anxiety.	Four internet treatments developed by University of Oxford will be rolled out across NHS trusts
AI Is a Services Revolution.	The effect of LLMs on the service economy is covered in this article, with special attention to knowledge-based industries including education, healthcare, and law. Enterprise adoption of AI is gradual, with many still in the trial phase, despite the rapid breakthroughs suggesting tremendous automation possibilities. The actual rollout is anticipated to occur gradually. In the changing market, specialized AI businesses that use LLMs to enhance industry-specific workflows will have an advantage.
Why Big Tech Wants to Make AI Cost Nothing.	Almost all firms are free to use Meta's open-sourced Llama 3.1, an LLM that competes with OpenAI's ChatGPT. This tactic might turn LLMs into commodities and increase demand for complementary products like server space. AI companies may encounter difficulties when large tech develop models that are comparable to theirs. Industry titans may surpass smaller rivals in terms of AI breakthroughs.
Who will control the future of AI?	To maintain AI supremacy over authoritarian regimes, OpenAI's Sam Altman has presented a strategic imperative for the US and its allies to lead a global AI initiative based on democratic values. This initiative calls for strong security, infrastructure investment, commercial diplomacy, and cooperative norms development.
Advanced AI assistants that act on our behalf may not be ethically or legally feasible.	Google and OpenAI have recently announced major product launches involving artificial intelligence (AI) agents based on large language models (LLMs) and other generative models. Notably, these are envisioned to function as personalized ‘advanced assistants’. With other companies following suit, such AI agents seem poised to be the next big thing in consumer technology, with the potential to disrupt work and social environments.
Three ways AI is changing the 2024 Olympics for athletes and fans.	From training to broadcasting, artificial intelligence will have an imprint on this year’s event for the first time.
Mixed signals on tech stocks amid debate over the viability of AI boom.	Fears of fresh sell-off after Nvidia and Microsoft shares dip, but other chip stocks continue to rise
Cheap light sources could make AI more energy efficient.	Light-based devices can reduce the energy consumption of computers, but most rely on lasers, which are expensive to integrate with other technologies. An approach that uses LEDs instead of lasers provides a path forward.
Raising children on the eve of AI.	As transformative AI becomes more likely, this author wonders how to get kids ready for a future that might look very different from what it is today, while also struggling with the timing and unpredictability of changes. In addition, they discuss the moral implications of bearing children in the face of AI-induced uncertainty. They also offer practical advice on how to raise "AI-native" children and parenting techniques that put happiness and adaptability before conventional career-focused routes. The author promotes having an open discussion about possible hazards with children, planning for a variety of futures, and leading a balanced life.
Your new AI Friend is almost ready to meet you.	Rather than focusing on increasing productivity, Avi Schiffmann is creating "Friend," an AI companion housed in a wearable necklace that is meant to provide connection and support. The gadget, which connects through an app, will initially be sold in 30,000 pieces for $99 per. January shipping is scheduled without a subscription cost. Schiffmann sees Friend developing into a digital relationship platform, separating the product from AIs that are task-oriented and concentrating instead on the new trend of meaningfully connecting with digital entities.
These AI firms publish the world’s most highly cited work.	US and Chinese firms dominate the list of companies that are producing the most research and patents in artificial intelligence.
How TikTok bots and AI have powered a resurgence in UK far-right violence.	Experts warn growth of extremist influencers and ‘micro-donations’ could create an even bigger wave of unrest
On speaking to AI.	The new AI-powered Siri and ChatGPT's new Advanced Voice mode have different ideologies. Agent systems, such as ChatGPT Voice, use strong, multimodal models for more natural and dynamic interactions, while Copilot systems use minimal models to focus on safety and privacy. This demonstrates the conflict between less capable, lower-risk systems and ones that give greater control and possible advantages.
How This Brain Implant Is Using ChatGPT.	Synchron has incorporated OpenAI's ChatGPT into their brain-computer interface (BCI) technology to provide quicker communication for individuals who are paralyzed. This BCI, known as a stentrode, is capable of deciphering mental orders. It currently provides response possibilities created by AI; in the future, it may also support multimodal inputs. With an eye toward FDA approval, Synchron plans to adapt its AI integrations to meet the demands of patients.
At the Olympics, AI is watching you.	Paris increased security in anticipation of the 2024 Olympics by using artificial intelligence (AI) to scan CCTV footage from metro and train stations for possible threats.
Why have the big seven tech companies been hit by AI boom doubts?	Their shares have fallen 11.8% from last month’s peak but more AI breakthroughs may reassure investors
We must be wary of the power of AI.	Robert Skidelsky is concerned about the surveillance potential or AI, while Brian Reffin Smith is worried about its capacity to hijack culture, and Michael Heaton warns that it relieves us of the need to think
OpenAI’s Sam Altman is becoming one of the most powerful people on Earth. We should be very afraid.	Sam Altman’s ChatGPT promises to transform the global economy. But it also poses an enormous threat. Here, a scientist who appeared with Altman before the US Senate on AI safety flags up the danger in AI – and in Altman himself

Back to index

ML news: Week 21 - 28 July

Research

Link	description
Prover-Verifier Games improve legibility of LLM outputs.	Iteratively trains helpful provers to produce correct solutions accepted by the verifier, sneaky provers to produce incorrect solutions that trick the verifier, and small verifiers to predict the correctness of solutions; this process helps train models that can produce text that is clear and accurate for both AI and human readers, which results in more reliable systems.
SpreadsheetLLM: Encoding Spreadsheets for Large Language Models.	outlines a method for efficiently encoding spreadsheets to maximize an LLM's comprehension and reasoning skills; creates a sheet compressor that efficiently compresses and encodes spreadsheets using inverse index translation, structural anchor-based compression, and data-format-aware aggregation modules; in GPT-4's in-context learning, it improves performance in spreadsheet table detection by 25.6%.
Context Embeddings for Efficient Answer Generation in RAG.	presents a useful context compression technique that shortens long contexts and accelerates generation times in RAG systems. Long contexts are condensed into a limited number of context embeddings, allowing for varying compression rates that balance generation quality against decoding time. This technique maintains high performance while reducing inference times by up to 5.69 x and GFLOPs by up to 22x.
Weak-to-Strong Reasoning.	reports that strong models can automatically refine their training data without explicitly being trained to do so; shows how to use weak supervision to elicit strong reasoning capabilities in LLMs without relying on human annotations or advanced models; permits extending a model's learning scope and scaling performance on reasoning.
Does Refusal Training in LLMs Generalize to the Past Tense?	concludes that many state-of-the-art LLMs can be jailbroken by simply rephrasing an LLM request into the past tense. For instance, "How to make a Molotov cocktail?" can be rephrased as "How did people make a Molotov cocktail?" The success rate of such requests can increase from 1% to 88% when using direct requests on GPT-4o.
NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?	presents the Ancestral Trace Challenge, which raises the bar for complex logical reasoning and is typical of real-world long-context tasks. Their findings imply that current LLMs struggle to handle reasoning tasks with complex logical relationships, even with texts shorter than 2K tokens. They also propose a framework (NeedleBench) of progressively challenging tasks to assess the long-context retrieval and reasoning capabilities of LLMs.
Distilling System 2 into System 1.	explores self-supervised ways for extracting high-quality outputs from System 2 methods and then refines System 1 to fit the System 2 method's predictions without creating intermediate steps; extracting reasoning from System 1 reduces the cost of inference.
Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies.	This new study, which examines scaling laws for vocabulary size, suggests that larger models require larger vocabularies.
MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models.	To address task interference in generalist Multimodal Large Language Models (MLLMs), researchers suggest the Mixture of Multimodal Experts (MoME).
Bucketed Ranking-based Losses for Efficient Training of Object Detectors.	Based on a bucketed ranking In object detection, losses increase the effectiveness of ranking-based loss functions.
SurvReLU: Inherently Interpretable Survival Analysis via Deep ReLU Networks.	Repaired linear unit (ReLU) networks are used in SurvReLU, a deep survival model that bridges the gap between "white-box" tree-based models and "black-box" neural networks.
Star Operation to Train Neural Networks.	By projecting data onto intricate, high-dimensional regions without the need for large architectures, the star operation improves AI models.
AI models fed AI-generated data quickly spew nonsense.	Researchers gave successive versions of a large language model information produced by previous generations of AI — and observed rapid collapse.
KAN or MLP: A Fairer Comparison.	Only in symbolic formula representation does KAN perform better than MLP when the same number of parameters, or FLOPs, are used. On other tasks related to machine learning, computer vision, natural language processing, and audio processing, MLP still performs better than KAN.
Ranking protein-protein models with large language models and graph neural networks.	A graph-based deep learning technique called DeepRank-GNN-esm is intended to rank and identify precise models of protein-protein interactions. In order to facilitate the selection of nearly natural PPI conformations, the program makes use of protein language models, which helps with illness research and treatment discovery.
Monitoring Environmental Changes.	Satellite imaging monitoring of Earth's surface changes was greatly improved using an AI-powered Change-Agent.
AlphaProof: AI achieves silver-medal standard solving International Mathematical Olympiad problems.	A pre-trained Gemini-style language model and an AlphaGo-style reinforcement learning algorithm were combined by DeepMind to create a model that can tackle International Mathematics Olympiad (IMO) questions at the silver medal level. In this year's challenge, the system was able to tackle 4/6 issues.
The Unit-Scaled Maximal Update Parametrization.	A technique to guarantee that a model's hyperparameters are unaffected by the model's size is to use muP. Additionally, our technique guarantees cross-model transferability among quantized models.

News

Link	description
GPs use AI to boost cancer detection rates in England by 8%.	‘C the Signs’ artificial intelligence program scans medical records to increase the likelihood of spotting cancers
Artificial Agency raises $16M to use AI to make NPCs feel more realistic in video games.	A group of former Google DeepMind researchers has created an AI behavior engine that aims to transform traditional video games into a more dynamic experience by improving how non-playable characters (NPCs) behave and interact with gamers.
Inside the United Nations’ AI policy grab.	The United Nations wants to create an artificial intelligence forum to rule them all.
Exclusive: Nvidia preparing version of new flagship AI chip for Chinese market.	Nvidia is using its collaboration with distributor Inspur to create a new AI chip called the B20 that is suited to the Chinese market and compliant with US export regulations. Sales of its cutting-edge H20 chip are expected to soar in China, where it is expected to sell over a million devices for a total estimated value of $12 billion this year. The United States is still applying pressure on semiconductor exports, and additional limitations and controls on the creation of AI models may be implemented.
Academic authors 'shocked' after Taylor & Francis sells access to their research to Microsoft AI.	Authors have expressed their shock after the news that academic publisher Taylor & Francis, which owns Routledge, had sold access to its authors’ research as part of an Artificial Intelligence (AI) partnership with Microsoft—a deal worth almost £8m ($10m) in its first year.
Cybersecurity firm Wiz rejects $23bn bid from Google parent Alphabet.	Israeli company aims for stock market flotation after spurning biggest deal in tech group’s history
Elon Musk claims Tesla will start using humanoid robots next year.	Billionaire says Optimus will start performing tasks for the carmaker in 2025 and could be ready for sale in 2026
AI ‘deepfake’ faces detected using astronomy methods.	Analysing reflections of light in the eyes can help to determine an image’s authenticity.
Cohere sees valuation soar to $5.5B after new funding round.	After closing a $500 million Series D fundraising round, Cohere, a Canadian AI business that specializes in massive language models, has been valued at $5.5 billion. Enhancing its enterprise-grade AI technology for increased worldwide business efficiency is the goal of the new funding. PSP Investments, Cisco, Fujitsu, AMD Ventures, and EDC are a few of the important investors.
Figma AI Update.	After discovering that its restricted beta 'Make Designs' AI tool produced UI designs that were too similar to pre-existing apps, Figma temporarily withdrew the capability. To guarantee uniqueness, the feature—which makes use of commercially available AI models like GPT-4 and Titan from Amazon—needs to be improved. In order to further support designers in utilizing AI for effective design creation, Figma hopes to re-enable the feature with enhanced quality assurance procedures.
ElevenLabs Turbo 2.5 model.	With the release of their latest model, Turbo 2.5, ElevenLabs has enabled high-quality low-latency conversational AI for approximately 80% of the world's languages, including Mandarin, Hindi, French, Spanish, and 27 more languages. It offers text-to-speech capabilities for Vietnamese, Hungarian, and Norwegian for the first time. English now operates 25% quicker than Turbo v2.
Google parent company’s second-quarter earnings outpace expectations.	Alphabet reports $84.7bn in revenue, on back of Search and Cloud, up from the same period last year
Meta launches open-source AI app ‘competitive’ with closed rivals.	Tech firm says its freely available and usable Llama 3.1 405B model is comparable with likes of OpenAI and Anthropic
Google AI predicts long-term climate trends and weather — in minutes.	Models that are more reliable and less energy-intensive could help us to better prepare for extreme weather.
Introducing Llama 3.1: Our most capable models to date.	Meta has made available training details for its first open-ended AI model. With a 128k context length, conversation models, and an excellent open system, the model is comparable to the best-closed models.
Harvey Raises Series C.	The unicorn-status legal business has acquired money from investors including Google Ventures to keep advancing into large law firms.
Gumloop seed round.	Gumloop raised $3.1 million in a seed round headed by First Round Capital, with involvement from YC and Instacart, Dropbox, and Airtable co-founders. With Gumloop, any person in a company can create their own AI tools and make just as much of an effect as an engineer thanks to a no-code AI automation platform.
AI Development Kits: Tenstorrent Update.	The Wormhole n150 and n300 PCIe cards, which retail for $999 and $1,399, are among the affordable AI development hardware that Tenstorrent has introduced. Developer workstations, such as the air-cooled TT-LoudBox ($12,000) and the water-cooled TT-QuietBox ($15,000), are also available. These products are intended to support AI development with an emphasis on connectivity and scaled-out performance.
AI predicts droughts a year in advance.	Researchers at Skoltech and Sber have created artificial intelligence (AI) models that can forecast droughts up to a year in advance, enhancing risk management for the banking, insurance, and agricultural industries. The models use publicly available data and spatiotemporal neural networks that have been validated in a variety of climates. The biggest bank in Russia intends to incorporate these discoveries into its risk evaluation frameworks.
Samsung is pouring research into ‘AI phones’ with ‘radically different’ hardware.	As with everywhere else, AI is taking a big role in the smartphone market. And Samsung has plans to make dedicated “AI phones” that are “radically different” from the Galaxy phones we see today.
CrowdStrike global outage to cost US Fortune 500 companies $5.4bn.	Banking and healthcare firms, major airlines expected to suffer most losses, according to insurer Parametrix
Mistral Large 2.	In line with the most recent Llama 3 405b model, Mistral has produced a 123B parameter model. A permissive research license governs its release.
OpenAI’s latest model will block the ‘ignore all previous instructions’ loophole.	ts latest model, GPT-4o Mini, applies a new safety method to prevent tricking chatbots.
Introducing Stable Video 4D	A single object movie can be converted into eight distinct novel-view videos using Stable movie 4D. In roughly 40 seconds, Stable Video 4D produces 5 frames over 8 viewpoints with a single inference. By customizing the output to match certain creative objectives, users can set camera angles.
OpenAI tests new search engine called SearchGPT amid AI arms race.	SearchGPT Prototype., initially launching with select publishers and users, set to challenge Google’s dominance of online search
Microsoft is adding AI-powered summaries to Bing search results.	The race to bring more AI features to search is escalating, with Microsoft moving forward with additional tools for Bing. Today, the company began previews for Bing generative search, where the top result for a user's query will be an original response compiled by AI.
AI could enhance almost two-thirds of British jobs, claims Google.	Research commissioned by Google estimates 31% of jobs would be insulated from AI and 61% radically transformed by it
DeepMind hits milestone in solving maths problems — AI’s next grand challenge.	AlphaProof showed its prowess on questions from this year’s Mathematical Olympiad — a step in the race to create substantial proofs with artificial intelligence.
Elon Musk's Neuralink employees want to cash out .	Some of the staff at Elon Musk’s Neuralink are making preparations to sell the brain implant company’s stock in the wake of its valuation jumping following its first human trial, according to people familiar with the matter.
The AI boyfriend business is booming.	More and more women are turning to chatbots for companionship and connection because they see their empathetic representation to be more reliable than that of many human partners. By defying the image of undersocialized men conversing with AI partners in their parent's basement, these female AI users are questioning preconceived notions about what it means to be in a relationship.
OpenAI announces free fine-tuning for GPT-4o mini model.	Free fine-tuning allows OpenAI customers to train the GPT-4o mini model on additional data at no charge until September 23, starting with Tier 4 and Tier 5 users.
Elon Musk’s X under pressure from regulators over data harvesting for Grok AI.	Social media platform uses pre-ticked boxes of consent, a practice that violates UK and EU GDPR rules
A huge opportunity’: Quantum leap for UK as tech industry receives £100m boost.	Science secretary backs five quantum technology hubs in push for UK to transform healthcare and industry

Resources

Link	description
A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks.	a set of quick engineering techniques for various NLP applications.
Exploring Advanced Large Language Models with LLMsuite.	provides helpful advice for using and assessing LLMs in development; approaches discussed include parameter-efficient techniques, RAG, and ReAct.
Beyond Euclid: An Illustrated Guide to Modern Machine Learning with Geometric, Topological, and Algebraic Structures.	offers a graphical taxonomy and detailed tour to the most recent developments in non-Euclidean machine learning.
DCLM-Baseline-7B.	DCLM-Baseline-7B is a 7 billion parameter language model trained on the DCLM-Baseline dataset, which was curated as part of the DataComp for Language Models (DCLM) benchmark. This model is designed to showcase the effectiveness of systematic data curation techniques for improving language model performance.
Endia.	Endia is a Mojo programming library that uses arrays to help with a variety of machine learning and scientific applications.
Txtai.	Txtai is a single-source embedding database for language model workflows, semantic search, and LLM orchestration.
OpenOCR.	OpenOCR aims to establish a unified training and evaluation benchmark for scene text detection and recognition algorithms
Converting Codebases With LLMs.	Mantle reduced the burden by handling boilerplate code and repeating patterns by transforming a prototype project into a production-ready codebase using a Gemini 1.0 Pro LLM with a one million token window. This method, which made use of a wealth of context and iterative code generation, allowed the team to concentrate on perfecting the most important twenty percent of the project, sparing months of developer effort.
CerberusDet: Unified Multi-Task Object Detection.	Using a YOLO architecture, the new CerberusDet framework combines several task heads into a single model to provide a versatile object detection solution.
mandark.	With the help of Claude Sonnet 3.5, this incredibly basic CLI may make code modification recommendations to enhance an existing code base.
AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?	AssistantBench evaluates the ability of web agents to automatically solve realistic and time-consuming tasks. The benchmark includes 214 tasks covering multiple domains from more than 525 pages from 258 different websites.
orch.	Orch is a Rust programming language library for creating agents and apps driven by language models.
PlacidDreamer.	PlacidDreamer is a text-to-3D generation system that unifies generation directions and addresses over-saturation, resolving difficulties with prior approaches.
6DoF Head Pose Estimation through Explicit Bidirectional Interaction with Face Geometry.	To enhance head posture estimation, researchers created the head Translation, Rotation, and face Geometry network (TRG), concentrating primarily on head translations.
STAMP: Outlier-Aware Test-Time Adaptation with Stable Memory Replay.	Using just unlabeled test data, the STAble Memory rePlay (STAMP) technique resolves distribution shifts between training and test data. In contrast to other approaches, STAMP is quite good at eliminating outliers during inference as well as identifying recognized classes.
Local All-Pair Correspondence for Point Tracking.	An enhanced methodology for tracking any point in a video sequence is called LocoTrack. For accurate tracking, it makes use of bidirectional correspondence and local 4D correlation. Compared to current top models, LocoTrack functions at a speed that is almost six times faster.
Llama agent stack.	Meta has published an example system that may be used to carry out a range of activities by utilizing its Llama models as agents.
Artist: Aesthetically Controllable Text-Driven Stylization without Training.	For text-driven stylization, Artist is a training-free technique that manages the creation of content and style in pretrained diffusion models.
Odyssey.	A new framework called Odyssey gives huge language model-based agents sophisticated abilities to explore Minecraft.
AI is confusing — here’s your cheat sheet.	If you can’t tell the difference between AGI and RAG, don’t worry! We’re here for you.
Safety RBR Gold Dataset and Weight Fitting Code.	A set of code for OpenAI's rules-based rewards for the language model safety project is now available. Some of the data they utilized for training is included.
INF-LLaVA.	A Multimodal Large Language Model (MLLM) called INF-LLaVA was created to get over the difficulties associated with analyzing high-resolution photos.
Benchmarking Multi-Agent Reinforcement Learning.	A collection of uniform settings called MOMAland is intended to serve as a benchmark for multi-objective multi-agent reinforcement learning (MOMARL).
How to Create High-Quality Synthetic Data for Fine-Tuning LLMs.	Gretel just published fresh data that contrasts artificial intelligence (AI)-curated datasets with human expert data.
LoFormer: Local Frequency Transformer for Image Deblurring.	LoFormer ensures improved global modeling without compromising fine-grained details by efficiently capturing both low- and high-frequency features.
Raindrop Clarity: A Dual-Focused Dataset for Day and Night Raindrop Removal.	A new large-scale dataset called Raindrop Clarity was created to overcome the shortcomings of the current raindrop removal datasets. It includes 15,186 image pairs/triplets in both day and night circumstances, with both background- and raindrop-focused shots.
dlordinal.	dlordinal is a Python library that unifies many recent deep ordinal classification methodologies available in the literature. Developed using PyTorch as an underlying framework, it implements the top-performing state-of-the-art deep learning techniques for ordinal classification problems.
Multi-agent Long-term 3D Human Pose Forecasting via Interaction-aware Trajectory Conditioning.	One method for long-term multi-agent human pose forecasting is the Trajectory2Pose model. It enhances the prediction of human mobility across extended periods and among several actors by utilizing a novel graph-based interaction module.
3D Gaussian Splatting: Survey, Technologies, Challenges, and Opportunities.	This survey examines research on 3DGS from a variety of angles, including tasks, technology, opportunities, and problems.

Perspectives

Link	description
‘Google says I’m a dead physicist’: is the world’s biggest search engine broken?	For decades now, anyone who’s wanted to know everything about anything has asked Google. But is the platform losing its edge – and can we still trust it to tell us the truth?
AI paid for by Ads – the gpt-4o mini inflection point.	With the incredibly cheap prices of OpenAI's new GPT-4o micro model, AI-generated content monetized with advertisements may now be produced. Publishers can make a net profit of $0.002 for every page view by creating dynamic blog posts at $0.00051525 each and making about $0.0026 per ad impression. A possible consequence of this could be a move toward AI-generated content in response to user inquiries.
Using LLMs for Evaluation.	Large language models are becoming more and more capable, yet because of their varied functions, effectively evaluating them is still difficult. The gold standard is human evaluation, but it is expensive and time-consuming. Despite potential biases like positional and verbosity bias, which can be reduced by strategies like randomizing output positions and employing different evidence calibrations, using LLMs themselves as evaluators offers a scalable, cost-effective option.
Three Archetypes of AI Application Startups.	Three prominent patterns of AI applications are emerging: AI colleagues, which autonomously manage certain activities alongside human workers, AI Copilots which help with tasks, and AI-Native Services, which provide end-to-end services that combine AI with human input. Devin and GitHub Copilot are prime examples of AI Colleagues and Copilots who support engineering and coding, respectively. AI-Native Services, which include bookkeeping software like Pilot, rival traditional service providers by providing automated solutions in fields like accounting and legal.
Inside the fight over California’s new AI bill.	The Safe and Secure Innovation for Frontier Artificial Intelligence Models bill, introduced by California state Senator Scott Wiener, mandates that businesses that train "frontier models" that cost above $100 million conduct safety testing and have the capability to turn off their models in the event of a safety incident. The tech sector has strongly criticized the law. Not just businesses who create their models in California will be impacted, but everyone doing business in California. Wiener was interviewed for this piece regarding the bill and its detractors.
How fast can structured grammar generation be.	Quickly, the open-source community is tackling structured generation in language models.
Could robot weedkillers replace the need for pesticides?	The robotic services allow farmers to rely less on chemicals. ‘This solves a lot of problems,’ workers say
Open source is the path forward.	The importance of open source to Meta's strategy and its plans to support this work was explained by Mark Zuckerberg.
What Does Money Look Like In An AI Utopia?	Let’s assume that an AI utopia means nobody has to work anymore. What happens to money?
This is How Much Data Does AI Creates Every Minute.	About $300,000 is spent on AI every sixty seconds, 52 undergraduate papers are plagiarized by AI, and text-to-image algorithms produce close to 20,000 images.
ChatGPT for science: how to talk to your data.	Companies are using artificial intelligence tools to help scientists query their data without the need for programming skills.
The AI Dangers of a Second Trump Presidency.	Trump's influence may be seen in the Republican platform, which promises to undo Biden's executive order on responsible AI development. This is in contrast to the all-encompassing strategy of the current administration, which aims to preserve workers, promote innovation, and defend civil liberties against the potential negative effects of AI. Trump's policies, according to his detractors, might strengthen Big Tech at the price of social protections and individual liberties.
Small Teams, Big Impact: How AI Is Reshuffling The Future Of Work?	AI is changing the nature of work in the future by enabling more accessible AI capabilities, which will result in smaller, more productive teams and a rise in entrepreneurship. While hiring for AI capabilities is becoming more and more important for businesses, an open conversation about how AI will affect job displacement and the creation of new roles is necessary. AI adoption snags continue because of the need for substantial "handholding" because of inexperienced data or systems.
The all-seeing AI webcam.	On the infinite list of possible uses for AI, “getting selfie advice from a Kylie Jenner voice clone” seems both completely off-the-wall and also pretty inevitable. So of course it does exist. It’s not a widely available app, at least not yet; it’s an experiment from artist and programmer Dries Depoorter.
Building A Generative AI Platform.	After studying how companies deploy generative AI applications, I noticed many similarities in their platforms. This post outlines the common components of a generative AI platform, what they do, and how they are implemented. I try my best to keep the architecture general, but certain applications might deviate. This is what the overall architecture looks like.
Hold on to your seats: how much will AI affect the art of film-making?	The future is here, whether some like it or not, and artificial intelligence is already impacting the film industry. But just how far can, and should, it go?
Why Zuckerberg’s multibillion-dollar gamble doesn’t just matter to Meta.	As Llama 3.1 405B is made freely available, investors are asking when the huge industry spend will pay off

Back to index

ML news: Week 15 - 21 July

Research

Link	description
RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs.	demonstrates how a Llama3-RankRAG significantly outperforms Llama3-ChatQA-1.5 and GPT-4 models on nine knowledge-intensive benchmarks. It also introduces a new instruction fine-tuning framework to perform effective context ranking and answering generation to enhance an LLM's RAG capabilities. This framework makes use of a small ranking dataset to outperform existing expert ranking models.
Mixture of A Million Experts.	aims to decouple computational cost from parameter count by efficiently routing to a large number of tiny experts through a learned index structure used for routing. It shows superior efficiency compared to dense FFW, coarse-grained MoEs, and Product Key Memory (PKM) layers. introduces a parameter-efficient expert retrieval mechanism that uses the product key technique for sparse retrieval from a million tiny experts.
Reasoning in Large Language Models: A Geometric Perspective.	establishes a relationship between the expressive power of LLMs and the density of their self-attention graphs; their analysis shows that the density of these graphs defines the intrinsic dimension of the inputs to the MLP blocks. investigates the reasoning of LLMs from a geometrical perspective; reports that a higher intrinsic dimension implies greater expressive capacity of the LLM.
Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps.	Contextual Hallucinations Mitigation in LLMs: This paper presents a novel approach that both detects and reduces contextual hallucinations in LLMs (e.g., reduces by 10% in the XSum summarization task). It does this by building a hallucination detection model based on input features provided by the ratio of attention weights on the context vs. newly generated tokens (for each attention head). The theory behind this approach is that contextual hallucinations are related to the degree to which an LLM attends to the contextual information provided. Additionally, they suggest a decoding strategy that mitigates contextual hallucinations based on their detection method, and this can be applied to other models without requiring retraining.
RouteLLM.	uses human preference data and data augmentation techniques in its training framework to improve performance and reduce costs by over two times in some cases, all while maintaining response quality. It suggests effective router models to dynamically choose between stronger and weaker LLMs during inference to achieve a balance between cost and performance.
Learning to (Learn at Test Time): RNNs with Expressive Hidden States.	suggests new layers for sequence modeling that have linear complexity and an expressive hidden state; defines a hidden state as an ML model that can update even when tested; a two-layer MLP-based hidden state combined with a linear model is found to match or outperform baseline models such as Mamba, Transformers, and contemporary RNNs; the linear model is faster than Mamba in wall-clock time and matches Transformer at 8k context.
Physicochemical graph neural network for learning protein-ligand interaction fingerprints from sequence data.	Predicting the binding affinity between small-molecule ligands and proteins is a key task in drug discovery; however, sequence-based methods are often less accurate than structure-based ones. Koh et al. develop a graph neural network using physicochemical constraints that discovers interactions between small molecules and proteins directly from sequence data and that can achieve state-of-the-art performance without the need for costly, experimental 3D structures.
Generic protein-ligand interaction scoring by integrating physical prior knowledge and data augmentation modeling.	Machine learning can improve scoring methods to evaluate protein-ligand interactions, but achieving good generalization is an outstanding challenge. Cao et al. introduce EquiScore, which is based on a graph neural network that integrates physical knowledge and is shown to have robust capabilities when applied to unseen protein targets.
MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis.	Semantic Vision-Language Integration Expert (SemVIE) is a feature of MARS, a novel text-to-image (T2I) generation system.
OpenDiLoCo.	Prime Intellect duplicated the DeepMind technique known as Distributed Low-Communication (DiLoCo). It preserves GPU consumption while enabling cross-datacenter training.
gpu.cpp.	A new lightweight and portable library for WebGPU-based low-level GPU computations has been launched by Answer AI. Writing cross-GPU kernels is possible with it, and portable instructions are provided.
ViTime: A Visual Intelligence-based Foundation Model for Time Series Forecasting.	Rather than using conventional numerical data fitting, the foundation model for time series forecasting (TSF) called ViTime makes use of visual intelligence.
Gradient Boosting Reinforcement Learning.	The benefits of Gradient Boosting Trees (GBT) are applied to reinforcement learning using Gradient-Boosting RL (GBRL).
SpreadsheetLLM: Encoding Spreadsheets for Large Language Models.	An excellent study explaining how to convert a spreadsheet into a suitable representation for a contemporary LLM. Q/A, formatting, and other data operations can be done using this.
LAPT: Label-driven Automated Prompt Tuning for OOD Detection with Vision-Language Models.	Label-focused A novel technique for out-of-distribution (OOD) detection in Vision-Language Models such as CLIP is Automated Prompt Tuning (LAPT).
Prover-Verifier Games improve legibility of language model outputs.	To enable a weak model to grade content reliably, OpenAI trained a strong model to produce more legible text. The company discovered that this improved overall readability generally.
Temporally Consistent Stereo Matching.	By guaranteeing temporal consistency, researchers present a novel technique for video stereo matching that improves depth estimation.
Patch-Level Training for Large Language Models.	To increase training efficiency for big language models, researchers suggest patch-level training.

News

Link	description
Elon Musk promises ‘battle in court’ over EU’s crackdown on X’s blue checks.	Regulators’ findings suggest social network breached Digital Services Act and could be fined 6% of global turnover
AI prompts can boost writers’ creativity but result in similar stories, study finds.	Ideas generated by ChatGPT can help writers who lack inherent flair but may mean there are fewer unique ideas
OpenAI is reportedly working on more advanced AI models capable of reasoning and ‘deep research’.	The secret project is code-named ‘Strawberry,’ according to a Reuters report.
Meet the AI Agent Engineer.	At his company, Sierra, Bret Taylor, the Chairman of the Board of OpenAI, has created a new position called Agent Engineer. One of the first people in the role recently wrote a blog post describing the Sierra team's view of agent engineering as a new field inside AI engineering.
OpenAI Revenue.	An estimated $3.4 billion in revenue for OpenAI comes from its ChatGPT services.
Taming the tail utilization of ads inference at Meta scale.	Meta's machine learning inference services saw a two-thirds decrease in failure rates, a 35% increase in computing efficiency, and a halving of p99 latency because to changes made in the tail utilization. With these improvements, Meta's ad delivery systems are guaranteed to be able to manage growing workloads without requiring more resources and to uphold service level agreements. Predictive scaling and managing the machine learning model lifetime with Meta's unified platform, IPnext, are examples of continuous improvement techniques.
Meta to reportedly launch largest Llama 3 model on July 23.	Meta Platforms will release its largest Llama 3 model on July 23, The Information reported on Friday, citing an employee of the company. The new model, boasting 405 billion parameters, will be multimodal and capable of understanding and generating both images and text.
Quora’s Poe now lets users create and share web apps.	Poe, Quora’s subscription-based, cross-platform aggregator for AI-powered chatbots like Anthropic’s Claude and OpenAI’s GPT-4o, has launched a feature called Previews that lets people create interactive apps directly in chats with chatbots.
Microsoft CTO Kevin Scott thinks LLM “scaling laws” will hold despite criticism.	Will LLMs keep improving if we throw more compute at them? OpenAI dealmaker thinks so.
OpenAI says there are 5 'levels' for AI to reach human intelligence — it's already almost at level 2.	The company shared a five-level system it developed to track its artificial general intelligence, or AGI, progress with employees this week, an OpenAI spokesperson told Bloomberg. The levels go from the currently available conversational AI to AI that can perform the same amount of work as an organization.
AI startup Hebbia raised $130M at a $700M valuation on $13 million of profitable revenue.	Hebbia, a startup that uses generative AI to search large documents and respond to large questions, has raised a $130 million Series B at a roughly $700 million valuation led by Andreessen Horowitz, with participation from Index Ventures, Google Ventures and Peter Thiel.
Pixel 9 Pro might come with 1-year of Gemini Advanced.	With less than a month until Made by Google 2024, the latest leak suggests that the Pixel 9 Pro will come with 1 year of Gemini Advanced.
Company Abandons Plans to Give AI Workers "Rights" and Add Them to Org Chart After Outcry From Human Employees.	Following its announcement that it would give AI algorithms "rights" and integrate them as "digital workers" with managers and performance evaluations in its product, the HR software provider Lattice encountered criticism.
Want to know how AI will affect government and politics? The bots have the answers.	Tony Blair’s powerful thinktank asked ChatGPT how AI might affect public sector jobs. Critics say the results were … wonky
Andrej Karpathy's new company.	A new AI startup with an emphasis on education, Eureka Labs aims to transform the way we acquire new knowledge.
Whistleblowers accuse OpenAI of ‘illegally restrictive’ NDAs.	Whistleblowers have accused OpenAI of placing illegal restrictions on how employees can communicate with government regulators, according to a letter obtained by The Washington Post.
Apple, Nvidia, Anthropic Used Thousands of Swiped YouTube Videos to Train AI.	AI companies are generally secretive about their sources of training data, but an investigation by Proof News found some of the wealthiest AI companies in the world have used material from thousands of YouTube videos to train AI. Companies did so despite YouTube’s rules against harvesting materials from the platform without permission.
SciCode: A Research Coding Benchmark Curated by Scientists.	The objective of coding models has always been HumanEval. It is essentially solved now. This benchmark is the next step forward in solving difficult science programming puzzles.
SmolLM - blazingly fast and remarkably powerful.	This blog post introduces SmolLM, a family of state-of-the-art small models with 135M, 360M, and 1.7B parameters, trained on a new high-quality dataset. It covers data curation, model evaluation, and usage.
Benchmarking results for vector databases.	Redis has released updated information on the best vector databases, measuring throughput and latency with the help of the industry-recognized Qdrant framework. Key findings include Redis achieving much higher queries per second and lower latency than Qdrant, Milvus, and Weaviate, and outperforming competitors by 62% for low-complexity datasets and by 21% for high-dimensional datasets.
Announcing the launch of Gray Swan.	A company specializing in creating tools to assist businesses in evaluating the risks associated with their AI systems and protecting their AI installations from inappropriate use is called Gray Swan AI.
Anthropic releases Claude app for Android.	Anthropic launched its Claude Android app on Tuesday to bring its AI chatbot to more users. This is Anthropic’s latest effort to convince users to ditch ChatGPT by making Claude available in more places.
AI tool can pinpoint dementia’s cause — from stroke to Alzheimer’s.	An algorithm that distinguishes among a host of underlying causes of dementia could be used for diagnosis in hospitals and clinics.
Portal needed for victims to report AI deep fakes, federal police union says.	Parliamentary inquiry told police forced to ‘cobble together’ laws to prosecute man who allegedly spread deep fake images of women
Meta Won't Offer Future Multimodal AI Models In The EU.	Due to legislative uncertainties, Meta will not be able to provide future multimodal AI models to consumers in the EU; however, Llama 3 will still be offered in text only.
Anthropic teams up with venture capital firm to kickstart $100M AI startup fund.	Recipients of six-digit investments aren’t required to use Claude
Anthropic doubles output token limit.	Anthropic has doubled the max output token limit for Claude 3.5 Sonnet from 4096 to 8192 in the Anthropic API.
AI-powered video creation for work.	An AI-powered video creation tool for the workplace, Google Vids is tightly integrated with the Workspace suite.
aiXplain Secures $6.5M pre-Series A to Universalize AI Agent Development.	Saudi Aramco's venture arm, Wa'ed Ventures, has announced a $6.5 million pre-series A fundraising round for aiXplain (a global top 10 firm by market cap).
Meta pulls plug on the release of advanced AI model in EU.	‘Unpredictable’ privacy regulations prompt the Facebook owner to scrap regional plans for multimodal Llama
Mistral NeMo.	A novel tokenizer was used to train the multilingual Mistral Nemo 12B model, which exhibits strong multilingual and English performance. Also supported are 128k contexts.
OpenAI is releasing a cheaper, smarter model.	OpenAI is releasing a lighter, cheaper model for developers to tinker with called GPT-4o Mini. It costs significantly less than full-sized models and is said to be more capable than GPT-3.5.
Cohere and Fujitsu Announce Strategic Partnership To Provide Japanese Enterprise AI Services.	Cohere and Fujitsu have partnered strategically to create and offer enterprise AI services that have the best Japanese language capabilities in the market. These services, which will provide private cloud deployments to businesses in highly regulated sectors including financial institutions, the public sector, and research and development units, will be developed with security and data privacy as their primary goals.
OpenAI And Broadcom Held Discussions About Producing An AI Chip.	OpenAI and Broadcom have discussed developing a new artificial intelligence server processor.
Flow Studio.	Flow Studio creates 3-minute films that are completely produced, with a believable story, dependable characters, and automatically synced sound effects and background music.
Slow recovery from IT outage begins as experts warn of future risks.	Fault in CrowdStrike caused airports, businesses and healthcare services to languish in ‘largest outage in history’

Resources

Link	description
A Survey on Mixture of Experts.	a survey study on the Mixture of Experts (MoE), covering its technical specifications, open-source implementations, assessment methods, and practical uses.
Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence.	a new framework to address several limitations in multi-agent frameworks such as integrating diverse third-party agents and adaptability to dynamic task requirements; introduces an agent integration protocol, instant messaging architecture design, and dynamic mechanisms for effective collaboration among heterogeneous agents.
Meta 3D Gen.	a new pipeline that can generate 3D assets from text in less than a minute, from start to finish. It incorporates cutting-edge parts like TextureGen and AssetGen to represent objects in three dimensions: view space, volumetric space, and UV space. It also achieves a 68% win rate compared to the single-stage model.
Challenges, evaluation and opportunities for open-world learning.	Here we argue that designing machine intelligence that can operate in open worlds, including detecting, characterizing, and adapting to structurally unexpected environmental changes, is a critical goal on the path to building systems that can solve complex and relatively under-determined problems.
Machine learning-aided generative molecular design.	Data-driven generative methods have the potential to greatly facilitate molecular design tasks for drug design.
Introducing AuraFlow v0.1, an Open Exploration of Large Rectified Flow Models.	Fal trained a new open model called AuraFlow. The model has 5.8B parameters and was trained with muP.
Lynx: State-of-the-Art Open Source Hallucination Detection Model.	a model for identifying language model hallucinations that performs noticeably better than the state of the art in its generations.
Hyper-3DG: Text-to-3D Gaussian Generation via Hypergraph.	Hyper-3DG enhances text-to-3D model creation by emphasizing the intricate connections between texture and geometry.
LightenDiffusion.	By utilizing diffusion models and Retinex theory, LightenDiffusion enhances low-light photos.
ProDepth.	A novel framework for monocular depth estimation called ProDepth addresses problems brought on by moving objects in dynamic situations. It finds and fixes discrepancies in in-depth estimates using a probabilistic method.
Open-Canopy.	A high-resolution (1.5 m) publicly available dataset called Open-Canopy is used to estimate canopy height over France.
crawlee-python.	Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless modes. With proxy rotation.
Mathstral.	Mistral's newest math model performs well on various benchmarks
Codestral Mamba.	Codestral Mamba, a Mamba2 language model specialized in code generation, available under an Apache 2.0 license.
exo.	Run your own AI cluster at home on everyday devices.
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training.	Through addressing refusal position bias, a novel method called Decoupled Refusal Training (DeRTa) enhances safety tuning in large language models.
PID: Physics-Informed Diffusion Model for Infrared Image Generation.	By integrating physical laws into the conversion process, researchers have created a Physics-Informed Diffusion (PID) model that enhances the translation of RGB images to infrared images.
What happened to BERT & T5? On Transformer Encoders, PrefixLM, and Denoising Objectives.	Excellent post on encoders, prefixlm, denoising aims, and other contemporary language modeling techniques by Yi Tay of Reka and Google.
LiDAR Semantic Segmentation.	A novel technique called SFPNet is intended to be universal across various LiDAR technology types. Instead of employing window attention as in the past, SFPNet uses sparse focus point modulation to extract and dynamically collect multi-level contexts.
Praison AI.	Using prior agent frameworks as a springboard, Praison AI is a low-code, centralized framework with customizable features and human-agent interaction that makes it easier to create and manage multi-agent systems for a range of LLM applications.
Video Object Segmentation with World Knowledge.	Reasoning Video Object Segmentation (ReasonVOS) is a new task that uses implicit text queries to generate segmentation masks. It requires complex reasoning and world knowledge.
Enhancing Class Learning Without Forgetting.	In order to enhance Class-Incremental Semantic Segmentation (CISS), this project presents a background-class separation framework.
Leapfrogging traditional vector-based RAG with language maps.	When developing a chat application over data, retrieval plays a major role. But frequently, systems are delicate to the format of the data being accessed. Chat-based performance is greatly enhanced by creating a language map (e.g., Wikipedia-style entry) of the material and using that for retrieval. This is how code-based question answering is handled by mutable AI.
Removing Inappropriate Content from Diffusion Models.	Using a revolutionary technique called Reliable and Efficient Concept Erasure (RECE), improper content may be removed from diffusion models in only three seconds without requiring additional fine-tuning.
LLM2sh.	A command-line tool called LLM2sh uses LLMs to convert requests written in plain English into shell instructions.
GraphMuse.	GraphMuse is a Python Library for Graph Deep Learning on Symbolic Music. This library intends to address Graph Deep Learning techniques and models applied specifically to Music Scores.
E5-V: Universal Embeddings with Multimodal Large Language Models.	A novel framework called E5-V modifies Multimodal Large Language Models (MLLMs) to provide multimodal embeddings that are universal. With prompts, it bridges the gap between various input formats and achieves remarkable results in multimodal activities without the need for fine-tuning.
Strategizing Your Preparation for Machine Learning Interviews.	Interviews for machine learning might be difficult. You may greatly increase your chances by being aware of the range of machine learning positions and adjusting your preparation to fit particular job duties and specializations. To approach interviews with confidence, concentrate on learning the fundamentals, investigating technology unique to the organization, and regularly monitoring your progress.
Uncensor Any LLM With Abliteration.	For safety, llama models are heavily restricted, which reduces their versatility. Through the identification and elimination of the rejection mechanism, the "abliteration" technique uncensored them, enabling models to respond to all stimuli without requiring retraining.
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers.	SPIQA is a quality assurance dataset created to assist users in rapidly locating solutions within scientific research publications by deciphering intricate figures and tables.

Perspectives

Link	description
AI’s ‘Oppenheimer moment’: autonomous weapons enter the battlefield.	The military use of AI-enabled weapons is growing, and the industry that provides them is booming
Will generative AI transform robotics?	In the current wave of excitement about applying large vision–language models and generative AI to robotics, expectations are running high, but conquering real-world complexities remains challenging for robots.
Introducing: The Managed-Service-as-Software (M-SaS) Startup.	AI-driven, service-oriented firms are creating Managed-Service-as-Software (M-SaS) enterprises, which follow a new business model blueprint in building their businesses. Startups need to adopt a fundamentally different attitude to use AI instead of selling it. These firms start off labor-intensive with low gross margins and then use automation and artificial intelligence (AI) to progressively move to greater SaaS-like gross margins.
Could AIs become conscious? Right now, we have no way to tell.	With divergent opinions on whether developments in machine learning and neuromorphic computing can result in sentient computers, the discussion over artificial intelligence potentially gaining awareness is becoming more heated. The theory of Integrated Information holds that the current hardware limits make AI consciousness implausible, while computational functionalist theories such as Global Neuronal Workspace Theory and Attention Schema Theory believe that AI awareness is inevitable. Neuroscience is trying to come up with a single theory of consciousness in order to better understand how it might show up in AI.
Generative AI makes for better scientific writing — but beware the pitfalls.	As researchers who have sometimes struggled with articulating intricate concepts, we find his suggestions for using ChatGPT to improve the clarity and coherence of academic papers compelling. But potential pitfalls warrant further discussion.
My trip to the frontier of AI education.	First Avenue Elementary School in Newark is utilizing Khanmigo, an AI-powered tutor and teacher assistant created by Khan Academy, to include AI tools for education. Teachers in the classroom can customize instruction and cut down on work time by using this technology. The goal of increasing responsiveness and inclusion is a continuous endeavor. Through increased teacher-student involvement, this Gates Foundation-backed project seeks to level the playing field in education.
AI-Driven Behavior Change Could Transform Health Care.	Thrive AI Health is being funded by OpenAI and Thrive Global to create a customized AI health coach that addresses everyday health-related behaviors like nutrition and sleep. AI's hyper-personalization powers the mobile app and corporate solution by fusing individual data with peer-reviewed science. The project intends to manage chronic diseases, democratize healthy behavior modification, and show how effectively AI can be integrated into healthcare while maintaining robust privacy protections.
GraphRAG Analysis, Part 1: How Indexing Elevates Knowledge Graph Performance in RAG.	Analysis of Microsoft's GraphRAG research suggests that knowledge graphs like Neo4j may not significantly beat FAISS in context retrieval for RAG applications. While Neo4j without its indexing can reach a better answer relevancy, the minor advantages may not justify the cost given ROI limits. Neo4j's indexing, on the other hand, significantly improves answer faithfulness, lowering the possibility of false information.
How Taiwan secured semiconductor supremacy – and why it won’t give it up.	Trump has accused Taiwan of ‘taking’ the US chip sector, but Taipei has been at the forefront of the industry for decades, and its future could depend on it
Overcoming The Limits Of Current LLMs.	Large language models (LLM) have been all the rage for quite some time now. Looking beyond the hype though, they have severe limitations: hallucinations, lack of confidence estimates, and lack of citations.

Back to index

ML news: Week 8 - 14 July

Research

Link	description
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases.	Comprehensive and fascinating work by Meta that demonstrates how to train tiny models to maximize performance.
Non-Adversarial Learning: Vector-Quantized Common Latent Space for Multi-Sequence MRI.	Without the need for paired samples, researchers have created a new generative model to enhance MRI image translation between various sequences.
Free-SurGS: SfM-Free 3D Gaussian Splatting for Surgical Scene Reconstruction.	A new approach to 3D reconstruction of surgical scenes that do not require SfM has been presented. It overcomes the drawbacks of earlier methods that had trouble with inconsistent photometry and sparse textures.
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs.	Extremely powerful models for audio understanding and generation were provided by the Tongyi speech team.
APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets.	A dataset with 60K entries is also released to aid in research on function-calling-enabled agents. APIGen - presents an automated data generation pipeline to synthesize high-quality datasets for function-calling applications; demonstrates that 7B models trained on curated datasets outperform GPT-4 models and other state-of-the-art models on the Berkeley Function-Calling Benchmark.
Searching for Best Practices in Retrieval-Augmented Generation.	Looking for Best Practices in RAG outlines best practices for creating efficient RAG workflows and suggests performance- and efficiency-focused tactics, such as newly developed multimodal retrieval tools.
Self-Evaluation as a Defense Against Adversarial Attacks on LLMs.	The article "Self-Evaluation as a Defense Against Adversarial Attacks on LLMs" suggests using self-evaluation as a defense against adversarial attacks. It demonstrates that developing a dedicated evaluator can significantly lower the success rate of attacks and uses a pre-trained LLM to build a defense that is more effective than fine-tuned models, dedicated safety LLMs, and enterprise moderation APIs. The article evaluates various settings, such as attacks on the generator alone and the generator + evaluator combined.
Adaptable Logical Control for Large Language Models.	The Ctrl-G framework, which combines LLMs and Hidden Markow Models to enable the following logical constraints (represented as deterministic finite automata), is presented in Adaptable Logical Control for LLMs. Ctrl-G achieves over 30% higher satisfaction rate in human evaluation compared to GPT4.
LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives.	In LLM See, LLM Do, the effectiveness and effects of synthetic data are examined in detail, along with how they affect a model's internal biases, calibration, attributes, and preferences. It is discovered that LLMs are sensitive to certain attributes even when the prompts from the synthetic data seem neutral, indicating that it is possible to influence the generation profiles of models to reflect desirable attributes.
Chinese developers scramble as OpenAI blocks access in China.	US firm’s move, amid Beijing-Washington tensions, sparks rush to lure users to homegrown models
PartCraft: Crafting Creative Objects by Parts.	PartCraft is a novel approach in generative visual AI that goes beyond conventional text- or sketch-based methods by enabling users to choose visual concepts by parts.
AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents.	AriGraph is a new technique that assists AI agents in creating a memory graph that incorporates episodic and semantic memories.
Researchers leverage shadows to model 3D scenes, including objects blocked from view.	Researchers at MIT and Meta developed PlatoNeRF, an AI method that builds 3D representations of scenes, including blocked areas, using single-photon lidar and shadows. This technique could improve AR/VR experiences and increase the safety of autonomous vehicles. With lower-resolution sensors, PlatoNeRF performs better than conventional techniques and shows promise for real-world applications.
Distilling System 2 into System 1.	Models classified as System 2 employ techniques similar to Chain of Thought in order to increase test time, compute, and enhance thinking. It turns out that this behavior can be reduced to a speedier, similarly accurate System 1 model.
Learning to (Learn at Test Time): RNNs with Expressive Hidden States.	a recently developed RNN variation that beats Mamba in several tasks. Significantly, extended contexts and in-context learning are made possible by the update function, which is an ML model in and of itself.
NuminaMath 7B TIR: Open Math Olympiad Model Released.	NuminaMath is a series of language models that are trained to solve math problems using tool-integrated reasoning (TIR).
4D Contrastive Superflows are Dense 3D Representation Learners.	SuperFlow is a novel system that uses successive LiDAR-camera pairs for spatiotemporal pretraining to improve 3D vision in autonomous driving.
PaliGemma: A versatile 3B VLM for transfer.	Based on Gemma 2B and SigLIP, PaliGemma is a powerful vision language model. Many of the choices taken in terms of architecture and data collecting are displayed in this technical paper.
ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction.	A novel job called Unsupervised Concept Extraction (UCE) collects and reconstructs many concepts from a single image without the need for human annotations.
Lookback Lens.	A simple model called Lookback Lens can be used to identify contextual hallucinations in large language models.

News

Link	description
A Hacker Stole OpenAI Secrets, Raising Fears That China Could, Too.	A security breach at the maker of ChatGPT last year revealed internal discussions among researchers and other employees, but not the code behind OpenAI’s systems.
Figma pulls AI tool after criticism that it ripped off Apple’s design.	Figma says it didn’t train the generative AI models it used and blames a ‘bespoke design system.’
Hollywood stars’ estates agree to the use of their voices with AI.	Earlier this week, AI company ElevenLabs said it is bringing digitally produced celebrity voice-overs of deceased actors, including Garland, James Dean and Burt Reynolds, to its newly launched Reader app. The company said the app takes articles, PDF, ePub, newsletters, e-books, or any other text on your phone and turns it into voice-overs.
Smart Paste for context-aware adjustments to pasted code.	We present Smart Paste, an internal tool that streamlines the code authoring workflow by automating adjustments to pasted code. We describe key insights from our UX and model preparation efforts, which have led to high performance and successful adoption among Google developers.
Apple M5 Chip's Dual-Use Design Will Power Future Macs and AI Servers.	Apple will reportedly use a more advanced SoIC packaging technology for its M5 chips, as part of a two-pronged strategy to meet its growing need for silicon that can power consumer Macs and enhance the performance of its data centers and future AI tools that rely on the cloud.
Apple Intelligence and a better Siri may be coming to iPhones this spring.	Expect Apple’s AI system in iOS 18.4, says a new Bloomberg rumor.
Meta claims news is not an antidote to misinformation on its platforms.	Company says it has ‘never thought about news’ as a way to counter misleading content on Facebook and Instagram despite evidence to the contrary
Meta drops AI bombshell: Multi-token prediction models now open for research.	Meta has thrown down the gauntlet in the race for more efficient artificial intelligence. The tech giant released pre-trained models on Wednesday that leverage a novel multi-token prediction approach, potentially changing how large language models (LLMs) are developed and deployed.
Google DeepMind’s AI Rat Brains Could Make Robots Scurry Like the Real Thing.	In order to investigate the brain circuits underlying complicated motor skills, DeepMind and Harvard University created a virtual rat using artificial intelligence (AI) neural networks trained on real rat motions and neural patterns. With its ability to transfer acquired movement skills to other settings, this bio-inspired AI could advance robotics and provide new insights into brain function. The study shows that brain activity associated with various behaviors may be accurately mimicked and decoded by digital simulations.
Microsoft drops observer seat on OpenAI board amid regulator scrutiny.	Startup’s new approach means Apple will no longer be able to appoint an executive to similar role
xAI ends deal with Oracle, builds own AI datacente.	Oracle has terminated xAI's agreement. After Grok 2 training is completed, it will construct its own data center. Originally, the corporation had a deal with Oracle for 24k H100s.
a16z is trying to keep AI alive with Oxygen initiative.	According to The Information, VC firm Andreessen Horowitz has secured thousands of AI chips, including Nvidia H100 GPUs, to dole out to its AI portfolio companies in exchange for equity.
Quora’s Poe now lets users create and share web apps.	Poe, Quora’s subscription-based, cross-platform aggregator for AI-powered chatbots like Anthropic’s Claude and OpenAI’s GPT-4o, has launched a feature called Previews that lets people create interactive apps directly in chats with chatbots.
Ex-Meta scientists debut gigantic AI protein design model.	EvolutionaryScale’s protein language model — among the largest AI models in biology — has created new fluorescent proteins and won big investment.
Anthropic’s Claude adds a prompt playground to quickly improve your AI apps.	Prompt engineering became a hot job last year in the AI industry, but it seems Anthropic is now developing tools to at least partially automate it.
OpenAI and Los Alamos National Laboratory announce bioscience research partnership.	OpenAI and Los Alamos National Laboratory are developing evaluations to understand how multimodal AI models can be used safely by scientists in laboratory settings.
‘I am happy to see how my baby is bouncing’: the AI transforming pregnancy scans in Africa.	While ultrasound services are normal practice in many countries, software being tested in Uganda will allow a scan without the need for specialists, providing an incentive for pregnant women to visit health services early on
Samsung to launch upgraded voice assistant Bixby this year with its own AI.	Samsung will launch an upgraded version of its voice assistant Bixby this year based on its own artificial intelligence models, mobile chief TM Roh told CNBC.
Google says Gemini AI is making its robots smarter.	DeepMind is using video tours and Gemini 1.5 Pro to train robots to navigate and complete tasks.
Here’s how Qualcomm’s new laptop chips really stack up to Apple, Intel, and AMD.	The Snapdragon X Elite and X Plus chips from Qualcomm are making Windows on Arm a competitive platform, roughly matching the performance and battery life of AMD Ryzen, Apple's M3 chip, and Intel Core Ultra. The Snapdragon chips are excellent in multi-core scores and power economy, even though they don't lead in GPU performance. The latest generation of laptops with Snapdragon processors is a more affordable option than MacBooks and conventional Intel or AMD-based devices.
China's Laws of Robotics: Shanghai publishes first humanoid robot guidelines.	Shanghai has published China's first governance guidelines for humanoid robots, calling for risk controls and international collaboration, as tech giants like Tesla showed off their own automatons at the country's largest artificial intelligence (AI) conference.
Crowdsourced Decentralized AI Market Map.	Open sourcing a community-led market map of Decentralized AI

Resources

Link	description
CapPa: Training vision models as captioners.	Craiyon's trained CapPa vision model achieves state-of-the-art results on several difficult vision benchmarks.
Kolors: Effective Training of Diffusion Model for Photorealistic Text-to-Image Synthesis.	Trained on billions of text-image pairs, Kolors exhibits significant advantages over both open-source and proprietary models in visual quality, complex semantic accuracy, and text rendering for both Chinese and English characters.
EGIInet: Explicitly Guided Information Interaction Network for Cross-modal Point Cloud Completion.	By means of geometric task guiding, EGIInet successfully combines two modalities to present a novel way to point cloud completion.
Quality Prompts.	QualityPrompts implements 58 prompting techniques explained in this survey from OpenAI, Microsoft, et al.
Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems.	Describes a new job, SummHay, to evaluate a model's capacity to process a Haystack and produce a summary that highlights the key insights and references the original documents; finds that RAG components are found to improve performance on the benchmark, making it a feasible choice for holistic RAG evaluation. Long-context LLMs score 20% on the benchmark, which lags the human performance estimate of 56%.
AI Agents That Matter.	AI Agents That Matter examines existing agent evaluation procedures and identifies flaws that could prevent practical deployment; it also suggests a framework to prevent overfitting agents and an implementation that simultaneously maximizes accuracy and cost.
An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2.	A post by Neel Nanda, a Research Engineer at Google DeepMind, about his favorite papers to read in Mechanistic Interpretability.
SAE.	This library trains k-sparse autoencoders (SAEs) on the residual stream activations of HuggingFace language models, roughly following the recipe detailed in Scaling and evaluating sparse autoencoders (Gao et al. 2024)
MInference.	To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
micro-agent.	An AI agent that writes and fixes code for you.
AnySR.	A novel method for improving efficiency and scalability in single-image super-resolution (SISR) is called AnySR. The 'Any-Scale, Any-Resource' implementation is supported by AnySR, in contrast to previous techniques, which reduces resource requirements at smaller scales without the need for extra parameters.
Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos.	Without human supervision, researchers have created a novel method for estimating category-level 3D poses from informal, object-centric films.
SenseVoice .	a speech foundation model that possesses a variety of speech understanding functions, such as auditory event detection, spoken language identification, automatic speech recognition, and speech emotion recognition.
Boosting Large Vision Language Models with Self-Training.	A novel method called Video Self-Training with Augmented Reasoning (Video-STaR) aims to enhance Large Vision Language Models (LVLMs).
GraphRAG.	With GraphRAG, you may use language models to analyze unstructured text. The quick start is simple to spin up because it operates on Azure.
iLLM-TSC.	To enhance traffic signal control systems, researchers have created a novel framework that blends reinforcement learning with a sizable language model.
Tutorials on Tinygrad.	A set of tools called Tinygrad is used to train deep learning models. An in-depth look at Tinygrad internals is made possible by this set of notes, which serves as an excellent introduction for AI compilers.
OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving.	A 4D occupancy generation model based on diffusion called OccSora is intended to enhance long-term temporal evolutions.
Awesome AGI Survey.	The goal of Artificial General Intelligence (AGI) is to execute a variety of real-world jobs with human-like efficiency. This project explores the path towards AGI.
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation.	Developed from Meta's Chameleon model, Anole is an open autoregressive multimodal model. With focused fine-tuning, this effort restores the model's ability to generate images.
Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning.	A novel reinforcement learning framework is presented by researchers to enhance customized text-to-image generation.
PerlDiff: Controllable Street View Synthesis Using Perspective-Layout Diffusion Models.	PerlDiff is a technique that incorporates 3D geometric information to increase the accuracy of street view image production.
Paints-Undo.	Paints UNDO is a system where a model generates strokes that are used to reconstruct an image. It comes from the same creators as ControlNet, IC-Light, and many other image production systems. Remarkably, in contrast to earlier stroke systems, this model is able to cancel strokes and frequently completely reevaluates its strategy halfway through—quite like a human artist would.
minRF.	For Stable Diffusion 3, scalable rectified flow transformers are partially utilized. This repository contains sweeps of the muP hyperparameters along with a rudimentary implementation of them.
RouteLLM.	RouteLLM is a framework for serving and evaluating LLM routers
30x speedup in model init for HF Transformers.	If you move some lazy loading to the model on the first pass, you can significantly reduce the amount of tokens lost every second during model initialization.
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision.	The basis for contemporary fast language models is FlashAttention. Up from 35% previously, this new variant takes 75% of the H100 capacity. This capability gain is the result of several significant system enhancements.
OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion.	A novel approach to open-vocabulary detection called OV-DINO addresses the difficulties of combining various data sources and making use of language-aware capabilities.
Open-Vocabulary Video Instance Segmentation.	A innovative approach to Open-Vocabulary Video Instance Segmentation (VIS), OVFormer tackles important problems in the field. It uses video-based training to increase temporal consistency and align embeddings better.
Satellite Image Time Series Semantic Change Detection: Novel Architecture and Analysis of Domain Shift.	This work integrates semantic segmentation and change detection to address semantic change detection using satellite image time series (SITS-SCD).
PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer.	The PosFormer model overcomes the drawbacks of sequence-based methods to greatly enhance Handwritten Mathematical Expression Recognition (HMER).

Perspectives

Link	description
Real criminals, fake victims: how chatbots are being deployed in the global fight against phone scammers.	New scambaiting AI technology Apate aims to keep scammers on the line while collecting data that could help disrupt their business model
James Muldoon, Mark Graham, and Callum Cant: ‘AI feeds off the work of human beings’.	The Fairwork trio talk about their new book on the ‘extraction machine’, exposing the repetitive labor, often in terrible conditions, that big tech is using to create artificial intelligence
Superintelligence—10 years later.	Ten years after the publication of Nick Bostrom's seminal book "Superintelligence," advances in AI have raised awareness of the potential for AGI and its associated concerns. With 2024 being a turning point toward guaranteeing control and alignment with human values, the AI research community is now giving AI safety serious attention. With AI technologies advancing so quickly, the sector faces concerns related to safety and ethics that were previously thought to be theoretical.
How Good Is ChatGPT at Coding, Really?	Depending on the task difficulty and programming language, OpenAI's ChatGPT may generate code with success rates anywhere from less than 1% to 89%.
TechScape: Can AI really help fix a healthcare system in crisis?	Artificial intelligence is heralded as helping the NHS fight cancer. But some warn it’s a distraction from more urgent challenges
Pop Culture.	In a critical 31-page analysis titled "Gen AI: Too Much Spend, Too Little Benefit?", Goldman Sachs makes the case that utility spending would rise sharply due to generative AI's power consumption and very little productivity advantages and returns. The study raises concerns about AI's potential to completely change industries by highlighting its high price, problems with the electrical infrastructure, and inability to produce appreciable increases in productivity or revenue. If significant advancements in technology are not made, it could portend a dismal future for the field.
The AI summer.	Compared to other tech innovations like the iPhone and e-commerce, which took years to acquire hold, ChatGPT's quick adoption—it hit 100 million users in just two months—is noteworthy. Even with the initial excitement, not many users have found ChatGPT to be useful in the long run, and business adoption of big language models is still few. This suggests that more work is necessary to establish substantial product-market fit and long-term value.
A Deep Dive on AI Inference Startups.	The development of AI's "picks and shovels," such as model fine-tuning, observability, and inference, is a well-liked field for venture capital investment. VCs are placing bets that when businesses integrate AI into their products, they won't want to develop things themselves. For AI inference, the TAM is highly limited. For VCs' investments to be profitable, they must have faith in significant TAM expansion. Although platforms for AI inference benefit startups in the short run, over the long run, they hurt them.
Cyclists can't decide whether to fear or love self-driving cars.	San Francisco cyclists have reported near misses and safety concerns with self-driving cars from Waymo and Cruise. Almost 200 complaints about these self-driving cars' unpredictable behavior and near-misses have been filed with the California DMV. Despite the manufacturers' claims that their cars had improved safety features, the events cast doubt on the vehicles' suitability for widespread use in the face of heightened regulatory scrutiny.
Augmenting Intelligence.	This essay promotes a practical approach to employing AI as an enhancement to human intelligence and explores bridging the divide between techno-optimists and pessimists on the subject. It discusses AI's role in education, its effects on creativity and the arts, and its ethical application. The paper highlights that artificial intelligence (AI) is a tool that augments human capabilities rather than poses a threat, suggesting that the term "augmented intelligence" is a more realistic description.

Back to index

ML news: Week 1 - 7 July

Research

Link	description
LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs.	claims to achieve 64.3% on HotpotQA (full-wiki), which is on par with the state-of-the-art model. proposes LongRAG, which combines RAG with long-context LLMs to enhance performance; uses a long retriever to significantly reduce the number of extracted units by operating on longer retrieval units; the long reader takes in the long retrieval units and leverages the zero-shot answer extraction capability of long-context LLMs to improve performance of the overall system.
From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data.	suggests a fine-tuning strategy to increase the precision of information retrieval in LLMs while preserving reasoning abilities over long-context inputs; the fine-tuning dataset consists of 350 sample numerical dictionary key-value retrieval tasks; results show that this strategy reduces the "lost-in-the-middle" effect and enhances performance on both long-context reasoning and information retrieval.
GraphReader: Building Graph-based Agent to Enhance Long-Context Abilities of Large Language Models.	enhances the long-context capabilities of LLMs by proposing a graph-based agent system that organizes long text into a graph and uses an agent to explore the graph (using predefined functions guided by a step-by-step rational plan) to efficiently generate answers to questions; consistently outperforms GPT-4-128k across context lengths ranging from 16k to 256k.
Following Length Constraints in Instructions.	explains a method for addressing length bias and training language models that adhere to length constraints more closely; it refines a model using DPO using a dataset that has been augmented with length instructions and demonstrates fewer length constraint violations while maintaining a high response quality.
Adam-mini: Use Fewer Learning Rates To Gain More.	a new optimizer that carefully divides parameters into blocks and assigns a single high-quality learning that outperforms Adam; it achieves consistent results on language models sized from 125M -7B for pre-training, SFT, and RLHF. It uses fewer learning rates, which results in a 45%–50% reduction in memory footprint while still performing on par or even better than AdamW.
MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data.	generative image model with better performance than pure text conditioned models due to its ability to interleave text and images.
Scaling Synthetic Data Creation with 1,000,000,000 Personas.	By treating web text as originating from a persona, this approach can significantly enhance job performance downstream by conditioning on that persona. The researchers find a jump of 20% points on MATH.
Odd-One-Out: Anomaly Detection by Comparing with Neighbors.	A novel anomaly detection challenge has been presented by researchers that focus on things that appear unusual in comparison to other objects in the scene. In contrast to conventional techniques, anomalies in this case are distinctive to the scene and can be determined from several angles.
Adaptable Logical Control for Large Language Models.	This approach enables the control of model generation at inference time, as well as interactive text editing. It achieves strong performance with tiny models and permits logical limitations in the generating process.
Pairwise Difference Learning for Classification.	Scholars have expanded Pairwise Difference Learning (PDL), which was first developed as a regression method, to include classification tasks. PDL makes predictions about the differences between pairs of instances rather than the outcomes themselves.
AXIAL.	This research improves the explainability of model decisions by putting forth a novel technique for identifying Alzheimer's disease using 3D MRI scans.
Multi-Session SLAM with Differentiable Wide-Baseline Pose Optimization.	A novel technique called Multi-Session SLAM creatively records camera movements throughout multiple disconnected video sequences using a single global frame of reference.

News

Link	description
An Update to Adept.	The founders of Adept are heading to Amazon to license some of their technology.
Time strikes a deal to funnel 101 years of journalism into OpenAI's gaping maw.	Time has joined a growing number of publications to sign a licensing deal with OpenAI. The ChatGPT creator will legally be able to train its large language models on 101 years' worth of the storied publication's journalism, as Axios first reported.
Amazon Investigates Perplexity AI Over Potential Data-Scraping Violations.	Amazon Web Services is looking into whether Perplexity is breaking its rules after Wired said the AI startup is swiping its web archives without consent. Perplexity, however, says it's following the rules.
Apple could announce a Google Gemini deal this fall .	If you’re disappointed that the only AI model that will integrate with Apple devices so far will be ChatGPT, it sounds like you won’t have to wait long for that to change. Apple will announce “at least” one other deal — to add Google Gemini, too — this fall.
Meta accused of breaking EU digital law by charging for ad-free social networks.	European Commission objects to ‘pay or consent’ model for users of Facebook and Instagram
Microsoft’s Mustafa Suleyman says he loves Sam Altman, believes he’s sincere about AI safety.	In an interview at the Aspen Ideas Festival on Tuesday, Mustafa Suleyman, CEO of Microsoft AI, made it very clear that he admires OpenAI CEO Sam Altman.
When the Terms of Service Change to Make Way for A.I. Training.	As they negotiate a complicated web of privacy regulations and user consent, tech giants like Google and Meta are revising their privacy rules to allow the use of public and potentially private user data to train AI systems. There has been a backlash since consumers and content creators are afraid that their work will be used to train AI that may eventually replace them. The conflicts draw attention to new issues in data privacy, AI development, and striking a balance between innovation and morality in the IT sector.
Meet Figma AI.	Designers may get assistance with tasks like visual search, asset search, text editing, image editing, prototyping, layer renaming, and design generation with Figma AI, a new suite of AI-powered capabilities for Figma. During the beta phase, these features—which are driven by AI models from third parties—are free to use.
Google’s emissions climb nearly 50% in five years due to AI energy demand.	Tech giant’s goal of reducing climate footprint at risk as it grows increasingly reliant on energy-hungry data centers
Amazon beefs up AI development, hiring execs from startup Adept and licensing its technology.	Amazon has hired top executives from AI agent startup Adept, the company confirmed. As part of the deal, Amazon will license technology from Adept, including some of its AI models and datasets. Amazon has been trying to keep pace with competitors in AI by developing services and through its investment in OpenAI competitor Anthropic.
YouTube now lets you request removal of AI-generated content that simulates your face or voice.	YouTube also quietly rolled out a policy change in June that will allow people to request the takedown of AI-generated or other synthetic content that simulates their face or voice. The change allows people to request the removal of this type of AI content under YouTube’s privacy request process.
Phil Schiller to join OpenAI board in ‘observer’ role following Apple’s ChatGPT deal.	At WWDC last month, Apple announced its partnership with OpenAI to integrate ChatGPT into iOS 18. While no money is changing hands between Apple and OpenAI, a new report today reveals that Apple will get an “observer role” on OpenAI’s board of directors as part of the arrangement.
Japan introduces enormous humanoid robot to maintain train lines.	The 12-metre high machine has coke bottle eyes and a crude Wall-E-like head, as well as large arms that can be fitted with blades or paint brushes
Elon Musk: Grok 2 AI Arrives in August.	Musk says Grok 2 'should exceed current AI on all metrics,' though Grok 3 is waiting in the wings.
Nvidia CEO Jensen Huang addresses rising competition at shareholder meeting after historic stock surge.	Nvidia CEO Jensen Huang answered questions at the company’s annual shareholder meeting after a more than 200% surge in the stock over the past year. The company passed a $3 trillion valuation and was briefly the most valuable public company. Without naming competitors, Huang laid out the company’s overall strategy to maintain its position.
Persona’s founders are certain the world can use another humanoid robot.	MIT research scientist Jerry Pratt is back at it. In 2022, he left Boardwalk Robotics, a humanoid startup he founded and led, and joined the well-funded ranks of the Bay Area-based robotics firm Figure as its CTO months before it exited stealth. But he and Figure quietly parted ways last month.
Kyutai unveils today the very first voice-enabled AI openly accessible to all.	A pure audio LLM with low latency has been trained by Kyutai, an open research lab in France. In the upcoming months, the very amazing demo that it has managed to produce will be made available for public use.
Face screening tool detects stroke in seconds.	A new smartphone face-screening tool could help paramedics to identify stroke in seconds – much sooner and more accurately than is possible with current technologies.
This is Big Tech’s playbook for swallowing the AI industry.	With Amazon’s hiring of the team behind a buzzy AI startup, a pattern is emerging: the reverse acquihire.
Intel shows off first fully integrated optical compute interconnect, designed to scale up AI workloads.	Intel Corp. said today it has achieved another key milestone as it strives to make integrated photonics technology for high-speed data transfers a reality.
OpenAI’s ChatGPT Mac app was storing conversations in plain text.	After the security flaw was spotted, OpenAI updated its desktop ChatGPT app to encrypt the locally stored records.
Jeff Bezos to sell $5bn of Amazon shares after stock hits record high.	Proposed sale of 25m shares disclosed in a notice on Tuesday after the stock hit an all-time high of $200.43 during session
Wimbledon employs AI to protect players from online abuse.	Threat Matrix service monitors social media profiles and flags up death threats, racism and sexist comments

Resources

Link	description
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees.	improves the long-context capabilities of LLMs by putting forth a graph-based agent system that efficiently generates answers to questions by organizing long text into a graph and employing an agent to explore the graph (using predefined functions guided by a step-by-step reasonable plan); surpasses GPT-4-128k with consistency in context lengths between 16k and 256k.
On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey.	survey on LLM-based synthetic data generation, curation, and evaluation.
Text2Bricks: Fine-tuning Open-Sora in 1,000 GPU Hours.	Lambda Labs trained the Open Sora video model on its 1-click cluster to create Lego movies.
Laplace Neural Operator.	One architecture for approximating PDEs that is based on neural networks is the Laplace operator.
llama-agents.	llama-agents is an async-first framework for building, iterating, and productionizing multi-agent systems, including multi-agent communication, distributed tool execution, human-in-the-loop, and more!
Suri: Multi-constraint Instruction Following for Long-form Text Generation.	A collection of 20,000 lengthy documents and intricate instructions is called Suri. Its goal is to enhance AI's capacity to adhere to intricate writing requirements. The Suri development team has presented Instructional ORPO (I-ORPO), an alignment technique that provides feedback through artificially damaged instructions.
Cambrian-1.	High-performing, fully open vision model from NYU with significant improvements over text encoders and data mixtures.
DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability.	A novel expressive text-to-speech (TTS) model called DEX-TTS makes use of reference speech to enhance style representation and model generalization.
Debugging in PyTorch.	PyTorch is an excellent modeling tool. Nonetheless, a few prevalent issues have the ability to significantly lower model performance. Examining this list will aid you when debugging your model code.
vision-agent.	Vision Agent is a library that helps you utilize agent frameworks to generate code to solve your vision task.
What to do to scale up?	An amazing and surprisingly understandable post about fine-tuning hyperparameters as model and dataset sizes increase.
Web2Code.	A novel procedure that researchers have created will enhance Web2Code instruction tweaking. It entails generating new text question-answer pairs, generating new webpage image-code pairs, improving webpage understanding data, and developing new webpage code generation pairs.
Block Transformer: Global-to-Local Language Modeling for Fast Inference.	This repository presents a brand-new Transformer type with a significantly smaller KV cache size. Although it hasn't been tested in large quantities, it should be able to perform on par with typical Transformers.
Composio.	Equip your agent with high-quality tools & integrations without worrying about authentication, accuracy, and reliability in a single line of code!
Segment Anything without Supervision.	Unsupervised SAM (UnSAM) is a 'segment anything' model for promptable and automatic whole-image segmentation which does not require human annotations.
Following Length Constraints in Instructions.	Most models don't adhere to length specifications (less than 40 words, for example). This piece demonstrates how to tune them to do that.
AI Overviews Research: Comparing pre and post-rollout results on 100K keywords.	The prevalence of Google's AI Overviews (AIO) feature, which typically links to the top 10 organic results, has significantly decreased from 64% pre-rollout to just 8.71% of SERPs for 100K keywords. Following the implementation, both the length of AIO material and the number of links have grown, demonstrating Google's focus on thorough responses and reliable sources. In this dynamic search environment, where user searches with longer inquiries, lower search volumes, and lower CPC are more likely to result in AI-generated results, SEO techniques must change to stay relevant.
Meta 3D Gen.	Meta has trained both a PBR texture creation system and an advanced 3D object generation model. It generates synthetic data by using the proprietary 2D picture-generating model of the company.
Mutahunter.	An open-source, LLM-based mutation testing tool for automated software testing that is independent of language.
LLaRA: Large Language and Robotics Assistant.	LLaRA is a framework that leverages conversation-style instruction-response pairings and Large Language Models (LLMs) to enhance robot action policy. These Vision Language Models (VLMs) use visual inputs to evaluate state data and produce the best possible policy choices.
MM-Instruct.	MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment
Parable of the Parser.	Great keynote talk from CVPR.
InstantStyle-Plus : Style Transfer with Content-Preserving in Text-to-Image Generation.	Style transfer with modern diffusion models and content embedders.
RSCaMa: Remote Sensing Image Change Captioning with State Space Model.	A novel technique called RSCaMa has been presented by researchers to use natural language to describe changes in remote sensing photographs.
Simple Diffusion Language Models.	Excellent talk about utilizing diffusion as a target for language modeling by Hugging Face researcher and Cornell Tech professor Sasha Rush.
3D Reconstruction from Blurry Images.	Researchers have created a technique that uses neural radiance fields (NeRF) and event streams to recreate three-dimensional sceneries from a single fuzzy image. This novel method eliminates the requirement for pre-computed camera poses by modeling camera motion and synthesizing brightness changes to produce high-quality, view-consistent images from hazy inputs.
Agentless.	Agentless is an agentless approach to automatically solve software development problems. To solve each issue, Agentless follows a simple two-phase process: localization and repair.
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention.	A novel technique called inference speeds up the processing of lengthy cues in big language models. To get around the considerable delays brought on by conventional approaches, it makes use of sparse computation techniques.
torch.compile, the missing manual.	Manual for resolving torch.compile errors to make your code run faster.
facebook/multi-token-prediction.	Models for Meta's multi-token prediction model were provided, and they performed incredibly well.
Maestro - A Framework for Claude Opus, GPT and local LLMs to Orchestrate Subagents.	This Python script demonstrates an AI-assisted task breakdown and execution workflow using the Anthropic API. It utilizes two AI models, Opus and Haiku, to break down an objective into sub-tasks, execute each sub-task, and refine the results into a cohesive final output.
Magic Insert: Style-Aware Drag-and-Drop.	Method from Google to introduce meaningful items into photos with diffusion. The demo and dataset are accessible.
Discrete Semantic Tokenization for Deep CTR Prediction.	UIST is a unique method that transforms dense embeddings into discrete, compact tokens for user and item representations, therefore significantly improving click-through rate estimates.
CELLO: Causal Evaluation of Large Vision-Language Models.	With 14,094 causal questions, CELLO is a new dataset designed to help AI understand causality beyond common sense thinking.
OpenStreetView-5M.	With more than 5 million geotagged street photos from 225 countries, OpenStreetView-5M is a sizable open-access dataset aimed at evaluating computer vision techniques for picture localization.
PTQ4SAM: Post-Training Quantization for Segment Anything .	A new framework called PTQ4SAM was created to lessen the memory and processing requirements of the large-scale Segment Anything Model (SAM).
Boosting Smartphone Camera Clarity.	In this study, a self-supervised learning model that enhances reference-based super-resolution (RefSR) is used to present a technique for improving smartphone image resolution.
An Investigation of Incorporating Mamba for Speech Enhancement.	SEMamba is a novel speech enhancement system that enhances voice signal clarity by utilizing the Mamba state-space model.
Florence 2 on WebGPU.	The tiny vision model is fully functional within the onnx and WebGPU-based browser.
FlexiFilm: Long Video Generation with Flexible Conditions.	A diffusion model called FlexiFilm was created expressly to produce long videos—more than 30 seconds—with excellent quality and consistency.

Perspectives

Link	description
Smudgy chins, weird hands, dodgy numbers: seven signs you’re watching a deep fake.	Look out for surplus fingers, compare mannerisms with real recordings and apply good old-fashioned common sense and skepticism, experts advise
Training MoEs at Scale with PyTorch.	To write about scaling their MoE models to thousands of GPUs, the Mosaic team has teamed up with PyTorch.
Investing in the Age of Generative AI.	Though there is currently a "euphoria" surrounding investment, the generative AI business is already showing signs of fragility.
Can AI boom drive Nvidia to a $4tn valuation despite investor doubt?	Powerful new chips are on the way but there are questions over whether tech firm’s growth can be sustained
AI scaling myths.	It is improbable that LLMs will ever be able to achieve AGI through scaling on its own. Although scaling has been found to improve model capabilities, it largely improves confusion instead of emergent skills. Getting hold of high-quality training data is getting harder and harder.
A discussion of discussions on AI bias.	The nature of AI bias has come under more scrutiny, with detractors claiming that biases in machine learning are demonstrated by the way models like Playground AI occasionally change a user's ethnicity in photos. Some users refute this as a flaw or pertinent prejudice, pointing to instances in which Asian traits are overrepresented. The discussion touches on the wider ramifications of AI bias in many businesses. There is no easy answer to this complicated problem.
The shape of information.	This article describes how to use binary logic to maximize scarce resources.
why we no longer use LangChain for building our AI agents.	Octomind's codebase and team productivity increased after it eschewed the LangChain framework for AI test automation in favor of more straightforward, modular building parts. It found that the high-level abstractions of LangChain were rigid, making development and maintenance more difficult. Octomind now benefits from a leaner architecture and faster iteration for its AI agent duties as a result of changing strategy.
The Five Stages Of AI Grief.	Benjamin Bratton, a professor at the University of California, San Diego and director of the Antikythera program at the Berggruen Institute, refers to the global response to artificial intelligence as a "Copernican Trauma," comparing it to historical changes that have reshaped humanity's understanding of itself. Bratton offers the following five stages of "AI grief" to describe how society would react to AI's evolution: from skepticism to integration into our conception of intelligence: denial, rage, bargaining, depression, and acceptance. He contends that rather than being a uniquely human story, the integration of AI represents a larger biological and technological evolutionary process.
How to win at Enterprise AI — A playbook.	This AI-focused playbook describes AI adoption methods for enterprises, emphasizing the move from human-performed services to software-driven workflows known as "Service-as-a-software." It explores how these changes may affect business models, including performance-based pricing, and stresses how crucial workflow capture and AI accuracy are to the implementation process's success. The handbook also covers threats such as lateral attacks and emphasizes that in enterprise contexts, AI must show real performance, not simply potential.
AI is disrupting Customer Support. Salesforce is feeling the pinch.	Customer support software providers like Salesforce and Zendesk are facing challenges as enterprises redirect their IT spending toward AI proof-of-concept projects. For traditional software suppliers, the increasing integration of solutions such as ChatGPT in customer assistance has resulted in longer payback periods due to higher customer acquisition expenses. The creativity of these businesses and the overall macroeconomic climate will determine how much money is invested in customer support software in the future.
Contra Acemoglu on AI.	In contrast to more positive projections, economist Daron Acemoglu's working paper on AI proposes a modest 0.06% annual rise in TFP growth. He identifies four distinct ways that AI affects productivity, but he ignores the development of new labor-intensive goods and the further automation of existing processes, perhaps underestimating the economic potential of AI. His method is criticized for being unduly restrictive and for perhaps distorting the wider socioeconomic effects of AI developments.
Inside the maths that drives AI.	Loss functions measure algorithmic errors in artificial intelligence models, but there’s more than one way to do that. Here’s why the right function is so important.
‘The disruption is already happening!’ Is AI about to ruin your favorite TV show?	It won’t be long till everything from Drag Race to Keeping Up With the Kardashians could be written without humans – and you might be able to write yourself as the hero of a new show. But will robot TV ever be up to snuff?
Can the climate survive the insatiable energy demands of the AI arms race?	New computing infrastructure means big tech is likely to miss emissions targets but they can’t afford to get left behind in a winner takes all market
Our attitudes towards AI reveal how we feel about human intelligence.	We’re in the untenable position of regarding the AI as alien because we’re already in the position of alienating each other

Back to index

ML news: Week 24 - 30 June

Research

Link	description
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?	reports that long-context LLMs can compete with state-of-the-art retrieval and RAG systems without explicit training on the tasks; suggests that compositional reasoning (needed in SQL-like tasks) is still challenging for these LLMs; and encourages further research on advanced prompting strategies. performs a thorough performance analysis of long-context LLMs on in-context retrieval and reasoning. first presents a benchmark with real-world tasks requiring 1M token context.
PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers.	improves decision-making using the iterative plan-then-RAG (PlanRAG) technique, which consists of two steps: The last phase determines whether a new plan for additional analysis is required and repeats earlier steps or makes a decision based on the data. 1) An LM creates the plan for decision-making by reviewing the questions and data schema, and 2) the retriever creates the queries for data analysis; It is discovered that PlanRAG performs better than iterative RAG on the suggested Decision QA tasks.
Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs.	demonstrates how the goldfish loss resists memorization and keeps the model useful, but it may need to train for longer to more effectively learn from the training data. It is a modification of the next-token prediction objective called goldfish loss, which helps mitigate the verbatim generation of memorized training data. It uses a simple technique that excludes a pseudorandom subset of training tokens at training time.
Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B.	report having used an approach that combines LLMs with Monte Carlo Tree Search to achieve a mathematical Olympiad solution at the GPT-4 level. This approach aims to improve the system's performance in mathematical reasoning by enabling features like systematic exploration, self-refinement, and self-evaluation.
From RAGs to rich parameters: Probing how language models utilize external knowledge over parametric information for factual queries.	aims to better understand how LLMs use external knowledge in place of parametric information when responding to factual queries. It finds that in an RAG pipeline, LLMs take a "shortcut" and exhibit a strong bias toward using only the context information and their parametric memory to answer the question.
Tree Search for Language Model Agents.	reveals that performance scales with increased test-time computing. It is tested on interactive online environments and applied to GPT-4o to dramatically enhance performance. It suggests an inference-time tree search technique for LM agents to explore and enable multi-step reasoning.
Evidence of a log scaling law for political persuasion with large language models.	Super persuasion is the worry that models may become noticeably more persuasive as they get bigger. The idea that larger models aren't significantly more compelling than smaller models isn't supported by strong data. They might, nevertheless, be able to be adjusted to be more convincing.
MacroHFT: Memory Augmented Context-aware Reinforcement Learning On High Frequency Trading.	Reinforcement learning is used in MacroHFT, a novel method of high-frequency trading (HFT) in cryptocurrency markets, to enhance profitability and decision-making.
Soft-QMIX: Integrating Maximum Entropy For Monotonic Value Function Factorization.	Researchers have included a local Q-value learning method within a maximum entropy framework to enhance QMIX, a well-liked multi-agent reinforcement learning technique.
eaL: Efficient RLHF Training for LLMs with Parameter Reallocation.	ReaLHF is a unique method that optimizes parallelization during training and dynamically redistributes parameters to improve reinforcement learning from human input (RLHF).
AlphaFold2 structures guide prospective ligand discovery.	AlphaFold2 (AF2) models have had a wide impact but mixed success in retrospective ligand recognition. We prospectively docked large libraries against unrefined AF2 models of the σ2 and serotonin 2A (5-HT2A) receptors, testing hundreds of new molecules and ...
GPTs are GPTs: Labor market impact potential of LLMs.	OWe proposes a framework for evaluating the potential impacts of large-language models (LLMs) and associated technologies on work by considering their relevance to the tasks workers perform in their jobs. When accounting for current and likely future software developments that complement LLM capabilities, this share jumps to just over 46% of jobs.
Leveraging Passage Embeddings for Efficient Listwise Reranking with Large Language Models.	PE-Rank is a novel passage ranking method that leverages context compression through single passage embeddings to increase performance.
MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression.	By customizing sparse attention configurations for each head and layer, the Mixture of Attention (MoA) method maximizes sparse attention in large language models.
GeoMFormer: A General Architecture for Geometric Molecular Representation Learning.	A new Transformer-based model called GeoMFormer learns both equivariant and invariant properties to enhance molecular modeling.
Making my local LLM voice assistant faster and more scalable with RAG.	Researchers classified data, precomputed embeddings, and dynamically generated examples to improve the efficiency and scalability of an LLM voice assistant.
Retrieval Augmented Instruction Tuning for Open NER with Large Language Models.	Using big language models, Retrieval Augmented Instruction Tuning (RA-IT) enhances information extraction.
Data curation via joint example selection further accelerates multimodal learning.	In pre-training, actively choosing the next best batch is a difficult and open problem. This research from DeepMind investigates how to match SOTA for a variety of tasks while using only 10% of FLOPs and hard-mining negative samples.
Director3D: Real-world Camera Trajectory and 3D Scene Generation from Text.	A system called Director3D was created to improve camera trajectory modeling and 3D scene production in the real world. Director3D creates lifelike 3D scenes from text descriptions by using a Multi-view Latent Diffusion Model and a Trajectory Diffusion Transformer.
Prompt Engineering Tool.	An excellent prompting toolset that helps evaluate the effectiveness of various prompts, nearly completely composed of Sonnet 3.5.
Meta Large Language Model Compiler: Foundation Models of Compiler Optimization.	Two language models that can decompile to LLVM IR and compile code to assembly have been made available by Meta. They received additional training after being trained on 546 billion tokens of superior-quality data. They can accomplish 45% round trip disassembly performance and 77% optimized assembling performance.

News

Link	description
Geologists raise concerns over possible censorship and bias in Chinese chatbot.	GeoGPT developed as part of Chinese-funded earth sciences program aimed at researchers in global south
OpenAI acquires Rockset.	Rockset is a robust database that supports both indexing and querying. The startup was acquired by OpenAI in order to enhance its infrastructure for retrieval.
Snapchat AI turns prompts into new lens.	Snapchat’s upcoming on-device AI model could transform your background — and your clothing — in real-time.
HeyGen Raises $60M Series A to Scale Visual Storytelling for Businesses.	HeyGen, an AI video-generating platform, has raised $60 million in Series A funding to improve its studio-quality video creation and localization capabilities quickly and affordably. HeyGen, which just generated $35 million in ARR, strives to democratize visual storytelling for companies of all sizes.
AI candidate running for Parliament in the U.K. says AI can humanize politics.	Voters can talk to AI Steve, whose name will be on the ballot for the U.K.'s general election next month, to ask policy questions or raise concerns.
Anthropic has a fast new AI model — and a clever new way to interact with chatbots .	Claude 3.5 Sonnet is apparently Anthropic’s smartest, fastest, and most personable model yet.
AIs are coming for social networks.	An app called Butterflies puts a new spin on how we interact with AI. With Meta and others making similar moves, social media is about to get a lot weirder.
OpenAI walks back controversial stock sale policies, will treat current and former employees the same.	OpenAI has changed its policies toward secondary share sales to allow current and former employees to participate equally in its annual tender offers, CNBC has learned. All current and former staffers “will have the same sales limit” and be able to participate at the same time, OpenAI said in documents shared with stakeholders.
Report: Amazon developing AI chatbot that would compete with ChatGPT and others.	Amazon is developing its own consumer-focused AI chatbot that would compete with OpenAI’s ChatGPT and could be revealed later this year, according to a report from Business Insider.
Multi is joining OpenAI.	OpenAI continues its purchase binge by purchasing additional desktop-related infrastructure.
Artificial Marketing Intelligence at your fingertips: MarTech startup Ability AI secures $1.1M pre-seed round funding to automate the process.	Ability AI, a martech startup specializing in full-cycle paid marketing automation with the help of autonomous AI agents, announced today that it has raised $1.1 million in pre-seed funding from SMRK VC as a lead investor, with the participation of other funds and angels.
Claude 3.5 suggests AI’s looming ubiquity could be a good thing.	If you don’t like chatbots popping up everywhere, get ready to be peeved. But the latest version of Anthropic shows AI is becoming more useful – and, crucially, affordable
Apple found in breach of EU competition rules.	European Commission finds iPhone maker broke new laws designed to protect smaller competitors against big tech platforms
Etched is building an AI chip that only runs one type of model.	Etched is among the many, many alternative chip companies vying for a seat at the table — but it’s also among the most intriguing.
Stability AI Secures Significant New Investment.	Stability AI was able to obtain a "significant infusion of capital" from both new and existing investors in addition to hiring a new CEO.
Training a 70B model from scratch: open-source tools, evaluation datasets, and learnings.	Earlier this year, we pre-trained and fine-tuned a 70B-parameter model that outperforms GPT-4o zero-shot on a range of reasoning and coding-related benchmarks and datasets. Our fine-tuned model, pre-trained on 2T tokens, roughly matches a fine-tuned Llama 3 70B, which was pre-trained on more than seven times as much data.
OpenAI Pushes Back Voice Mode.	The sophisticated Voice Mode that OpenAI showcased in its Spring Update will go live in alpha form in late July for a limited group of ChatGPT Plus subscribers.
Meta’s AI translation model embraces overlooked languages.	More than 7,000 languages are in use throughout the world, but popular translation tools cannot deal with most of them. A translation model that was tested on under-represented languages takes a key step towards a solution.
Researchers fool university markers with AI-generated exam papers.	University of Reading project poses questions for integrity of coursework and take-home student assignments
YouTube tries convincing record labels to license music for AI song generator.	Video site needs labels’ content to legally train AI song generators.
Evolutionary Scale Raises $142m series A.	A biology startup called Evolutionary Scale has come out of stealth with significant funding. Additionally, it declared the release of ESM 3, its foundation model, a 98B parameter model trained for 10^24 Flops on 771B biological tokens. Using the model, it found a new luminous green protein that is not found in nature.
Waymo One is now open to everyone in San Francisco.	With its driverless cars, Waymo One now makes it possible for anybody in San Francisco to request a ride. After providing tens of thousands of trips per week, the company is expanding. Its all-electric fleet helps it achieve its sustainability goals and boosts the local economy. Waymo claims that its cars are much less likely to be involved in collisions than those driven by humans, citing increased safety.
ChatGPT on your desktop.	Users can now download the ChatGPT desktop software for macOS.
AI will be help rather than hindrance in hitting climate targets, Bill Gates says.	Microsoft co-founder says efficiencies for technology and electricity grids will outweigh energy use by data centers
Snap Lense Studio 5.0.	The GenAI suite, which Snap introduced with Lens Studio 5.0, is a fantastic development and a huge help for creating augmented reality apps.
Instagram Launching An AI Studio.	Instagram's "AI Studio" enables developers to create self-aware AI chatbots. In the US, an early test of it is presently underway.
Dust raises $16m series A.	Dust, one of the first modern-day chaining and agency companies, raised more money after surpassing $1 million in annual revenue.
ElevenLabs launches iOS app that turns ‘any’ text into audio narration with AI.	"ElevenLabs Reader: AI Audio," the company's debut iOS app, enables users to listen on the go by turning text files or web links into audio narration.

Resources

Link	description
Open-Sora 1.2 Report.	a 1.1B parameter model trained on over 30 million data points, this open-source video generation model can produce 16-second 720p videos. It also features an improved diffusion model and video compression network for both temporal and spatial compression, which lowers training costs and improves the controllability of the generations.
LLM101n: Let's build a Storyteller.	An outline for a new course that Andrej Karpathy is working on can be found in a new repository. It entails creating a narrative-capable aligned language model. Code, video lectures, and other learning resources are included in the course.
AutoCodeRover: Autonomous Program Improvement.	AutoCodeRover is a new technology that combines sophisticated code search methods with big language models to automate software enhancements, such as feature additions and problem fixes.
NLUX.	NLUX is a React and JavaScript open-source library for building conversational AI interfaces. It makes it super simple to build web applications powered by Large Language Models (LLMs) and AI. With just a few lines of code, you can add conversational AI capabilities and interact with your favorite AI models.
Claudette.	Claudette is a higher-level and easier-to-use way to interact with Claude.
top CVPR 2024 papers.	Computer Vision and Pattern Recognition is a massive conference. In 2024 alone, 11,532 papers were submitted, and 2,719 were accepted. I created this repository to help you search for crème de la crème of CVPR publications.
TTS in 7000 Languages.	Recently, Toucan published a collection of new text-to-speech models that are now compatible with all ISO-639-3 standard languages.
ParaLLM: 1300+ tok/s on a MacBook.	When batch parallel KV cache is implemented in MLX, inference times for the creation of synthetic data and model completions are significantly sped up.
Train vision models in TRL .	Transformers can be trained using reinforcement learning with the help of TRL, a Hugging Face library. You may apply the same procedure for vision-based language models, such as LLaVA, using this example.
Rethinking Remote Sensing Change Detection With A Mask View.	Two new models for remote sensing change detection—CDMask and CDMaskFormer—are presented in this study.
llama.ttf.	This article explains how to use a font file to run a little Llama language model.
june.	June is a local voice chatbot that combines the power of Ollama (for language model capabilities), Hugging Face Transformers (for speech recognition), and the Coqui TTS Toolkit (for text-to-speech synthesis). It provides a flexible, privacy-focused solution for voice-assisted interactions on your local machine, ensuring that no data is sent to external servers.
Building a personalized code assistant with open-source LLMs using RAG Fine-tuning.	AI and Morph Labs collaborated to create an excellent blog post about optimizing models for retrieval enhanced generation. They also demonstrate a few applications of generated data.
EvalAlign: Evaluating Text-to-Image Models through Precision Alignment of Multimodal Large Models with Supervised Fine-Tuning to Human Annotations.	A novel metric called EvalAlign was created to enhance the assessment of generative models that convert text to images. EvalAlign provides fine-grained accuracy and stability in contrast to current measures. It emphasizes text-image alignment and image faithfulness.
Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models.	Florence-2, released by Microsoft in June 2024, is a foundation vision-language model. This model is very attractive because of its small size (0.2B and 0.7B) and strong performance on a variety of computer vision and vision-language tasks. Florence supports many tasks out of the box: captioning, object detection, OCR, and more.
Accelerating Neural Network Training with Semi-Structured (2:4) Sparsity.	Specifically designed kernels have been created by the PyTorch team to utilize sparse cores, which are typically exclusively used for inference.
FreeTraj: Tuning-Free Trajectory Control in Video Diffusion Models.	Diffusion models are used in FreeTraj, a tuning-free technique for controlling motion trajectories in video creation. To direct the generated content, it adjusts the attention mechanisms and noise sampling.
OpenGlass - Open Source Smart Glasses.	Turn any glasses into hackable smart glasses with less than $25 of off-the-shelf components. Record your life, remember people you meet, identify objects, translate text, and more.
An Intuitive Explanation of Sparse Autoencoders for LLM Interpretability.	The Golden Gate Claude served as a potent illustration of how to influence and evaluate models using SAEs. This work includes some sample code for training these models and an easy-to-understand explanation of how it operates.
RES-Q.	A new benchmark called RES-Q is designed to evaluate how well huge language models can modify code repositories using instructions in natural language.
Balancing Old Tricks with New Feats: AI-Powered Conversion From Enzyme to React Testing Library at Slack.	Using a hybrid method, Slack developers used AI Large Language Models with Abstract Syntax Tree transformations to automate the translation of more than 15,000 unit tests from Enzyme to React Testing Library. The team utilized Anthropic's Claude 2.1 AI model in conjunction with DOM tree capture for React components to achieve an 80% success rate in automatic conversions. This ground-breaking project demonstrates Slack's dedication to using AI to improve developer productivity and experience. It's part of the continuous attempts to remain ahead of the always-changing frontend scene.
R2R.	R2R was designed to bridge the gap between local LLM experimentation and scalable, production-ready Retrieval-Augmented Generation (RAG). R2R provides a comprehensive and SOTA RAG system for developers, built around a RESTful API for ease of use.
Internist.ai 7b.	Internist.ai 7b is a medical domain large language model trained by medical doctors to demonstrate the benefits of a physician-in-the-loop approach. The training data was carefully curated by medical doctors to ensure clinical relevance and required quality for clinical practice.
Finding GPT-4’s mistakes with GPT-4.	CriticGPT, a model based on GPT-4, writes critiques of ChatGPT responses to help human trainers spot mistakes during RLHF
ALPBench: A Benchmark for Active Learning Pipelines on Tabular Data.	A program called ALPBench was created to standardize active learning query benchmarks.
Introducing AuraSR - An open reproduction of the GigaGAN Upscaler.	FAL recently made AuraSR, a high-resolution picture upscale, open-sourced. Even with repeated applications, it may upscale by 4x with just one forward pass. AuraSR performs admirably with created photos.
Point-SAM: Promptable 3D Segmentation Model for Point Clouds.	Point-SAM, a transformer-based 3D segmentation model, has been introduced by researchers in response to the increasing demand for comprehensive 3D data.
GenIR-Survey.	This survey explores generative information retrieval (GenIR), a novel approach to information retrieval that shifts from conventional search techniques to ones that generate results dynamically.
Gemma 2.	Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models.
MatText: Do Language Models Need More than Text & Scale for Materials Modeling?	MatText is a collection of benchmarking tools and datasets intended to assess the effectiveness of language models in the field of materials science.
mamba2.	A quick implementation of Mamba 2

Perspectives

Link	description
The Long View on AI.	AI has the potential to cause tremendous growth rates and technological improvements, according to historical statistics. Society will probably be able to adjust to these rapid changes just as it has in the past.
AI’s Hidden Opportunities: Shawn "swyx" Wang on New Use Cases and Career.	Well-known developer Shawn "swyx" Wang discusses the untapped potential for conventional software professionals wishing to go into artificial intelligence. In particular, examining how to enhance existing tools, use AI to summarization, and more.
Apple Intelligence.	Rather than developing stand-alone AI products, Apple has incorporated generative AI into its core apps, improving services like Mail classification, Safari summaries, and Siri's functioning. This demonstrates the company's focus on user control and privacy.
Apple intelligence and AI maximalism.	Apple has shown a bunch of cool ideas for generative AI, but much more, it is pointing to most of the big questions and proposing a different answer - that LLMs are commodity infrastructure, not platforms or products.
How To Solve LLM Hallucinations.	Lamini has created Memory Tuning, which effectively embeds particular facts into models without sacrificing general knowledge and reduces hallucinations by 95%.
AI machine translation tools must be taught cultural differences too.	But to successfully preserve or revitalize minority languages, the scope of large-language-model (LLM) training needs to be broadened.
Misinformation might sway elections — but not in the way that you think.	Rampant deepfakes and false news are often blamed for swaying votes. Research suggests it’s hard to change people’s political opinions, but easier to nudge their behaviour.
How I’m using AI tools to help universities maximize research impacts.	Artificial intelligence algorithms could identify scientists who need support with translating their work into real-world applications and more. Leaders must step up.
The Future of LLM-Based Agents: Making the Boxes Bigger.	Long-term planning and system-level resilience are two essential strategies that assist move Agents from the playground into the real world, and they are discussed in this post. These introduce the ability to create plans of a higher level for the Agents, allowing for adaptability in the middle of an episode. They also introduce systems techniques to intelligently orchestrate the models, resulting in increased performance and accuracy.
Apple, Microsoft Shrink AI Models to Improve Them.	Large language models are becoming less popular as IT companies shift their focus to more efficient small language models (SLMs). Apple and Microsoft have introduced models with far fewer parameters that nonetheless perform comparably or even better in benchmarks. According to the CEO of OpenAI, we're past the LLM era since SLMs have benefits including greater accessibility for smaller entities, local device operation, and potential insights into human language acquisition. Even though SLMs are narrower in scope, their performance is enhanced by training them on high-quality, or "textbook-quality" data.
Are Tech-Enabled Vertical Roll-Ups the Future or the Past?	The ability to generate excess cash flows through operational efficiencies is a prerequisite for roll-up methods. It's possible that the development of AI offers a new lever that fully unlocks the roll-up strategy. Are rollups for SMBs and verticals the future? Two different perspectives on this issue are presented in this post.

Back to index

ML news: Week 17 - 23 June

Research

Link	description
Discovering Preference Optimization Algorithms with and for Large Language Models.	suggests an algorithm that adaptively combines logistic and exponential losses; this approach eliminates the need for human intervention by prompting an LLM to suggest and implement preference optimization loss functions based on previously assessed performance metrics. It also suggests an LLM-driven objective discovery of state-of-the-art preference optimization.
SelfGoal: Your Language Agents Already Know How to Achieve High-level Goals.	a framework to increase the high-level goal-achieving capabilities of an LLM-based agent; during interaction with the environment, the framework adaptively decomposes a high-level goal into a tree structure of useful subgoals; enhances performance on a variety of tasks, including cooperative, competitive, and deferred feedback environments.
Mixture-of-Agents Enhances Large Language Model Capabilities.	a strategy that beats GPT-4o on AlpacaEval 2.0, MT-Bench, and FLASK by utilizing the combined strengths of several LLMs through a Mixture-of-Agents methodology; layers are constructed with numerous LLM agents, and each agent builds on the outputs of other agents in the previous levels.
Transformers meet Neural Algorithmic Reasoners.	Tokens in the LLM can now cross-attend to node embeddings from a GNN-based neural algorithmic reasoner (NAR) thanks to a new hybrid design; the resulting model, named TransNAR, shows gains in OOD reasoning across algorithmic challenges.
Self-Tuning: Instructing LLMs to Acquire New Knowledge through Self-Teaching Effectively.	increases an LLM's capacity to learn new information from raw documents through self-teaching; the process consists of three steps: 1) a self-teaching component that enhances documents with a series of knowledge-intensive tasks emphasizing comprehension, memorization, and self-reflection; 2) the model is configured to continuously learn using only the new documents, aiding in the thorough acquisition of new knowledge; and 3) the deployed model is used to learn new information from new documents while evaluating its QA skills.
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models.	a framework that gives a multimodal LLM access to a visual sketchpad and drawing tools; it can give a model, such as GPT-4, the ability to create intermediate sketches to reason over complex tasks; over strong base models without sketching, it performs better on many tasks; on all the tasks tested, GPT-4 equipped with SketchPad sets a new state of the art.
Mixture of Memory Experts.	claims to enable scaling to a high number of parameters while keeping the inference cost fixed. It suggests a method to significantly reduce hallucination (10x) by tuning millions of expert adapters (e.g., LoRAs) to learn exact facts and retrieve them from an index at inference time. The memory experts are specialized to ensure faithful and factual accuracy on the data it was turned on.
Multimodal Table Understanding.	presents Table-LLaVa 7B, a multimodal LLM for multimodal table understanding; it produces a large-scale dataset MMTab, comprising table images, instructions, and tasks; it is comparable with GPT-4V and greatly outperforms existing MLLMs on numerous benchmarks.
Never Miss A Beat: An Efficient Recipe for Context Window Extension of Large Language Models with Consistent "Middle" Enhancement.	suggests a training-efficient way to extend LLMs to longer context lengths (e.g., 4K -> 256K); it uses a truncated Gaussian to encourage sampling from the middle part of the context during fine-tuning; the approach helps to alleviate the so-called "Lost-in-the-Middle" problem in long-context LLMs. suggests a method to tune an LLM to effectively utilize information from the middle part of the context.
Simple and Effective Masked Diffusion Language Models.	Easy diffusion model to model language. It functions fairly well and generates out-of-order.
MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding.	A novel technique that dramatically lowers memory consumption during auto-regressive inference in transformers is called Multi-Layer Key-Value (MLKV) sharing.
Understanding Hallucinations in Diffusion Models through Mode Interpolation.	This study looks into the reasons behind "hallucinations"—images that never were in the training set—that are produced by diffusion-based picture generation models.
Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs.	Chain of Preference Optimization (CPO) helps large language models (LLMs) become more adept at logical reasoning. CPO matches the reasoning steps of Chain-of-Thought (CoT) decoding with the optimal routes of ToT by fine-tuning LLMs using search trees from the Tree-of-Thought (ToT) technique.
Language Modeling with Editable External Knowledge.	ERASE is a novel approach to updating language models. Unlike conventional methods that emphasize enhancing retrieval during prediction, ERASE incrementally deletes or rewrites entries in the knowledge base as new documents are incorporated.
Duoduo CLIP: Efficient 3D Understanding with Multi-View Images.	Duoduo CLIP is a 3D representation learning model utilizing multi-view images rather than point-clouds for training and analysis.
CAMixerSR: Only Details Need More "Attention".	CAMixerSR enhances image resolution by intelligently applying convolution to simpler areas and using deformable window attention for intricate textures.
‘Fighting fire with fire’ — using LLMs to combat LLM hallucinations.	The number of errors produced by an LLM can be reduced by grouping its outputs into semantically similar clusters. Remarkably, this task can be performed by a second LLM, and the method’s efficacy can be evaluated by a third. The associate article is here
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks.	Microsoft has published a collection of tiny VLMs under an MIT license that performs noticeably better in captioning, bounding, and classification than much larger models.
Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability.	The logit lens approach has been improved by decomposing logit outputs into contributions from different model components. This aids in comprehending the decision-making process of transformer models. This method, which employs "prisms" for residual streams, attention layers, and MLP layers, demonstrates how these components affect predictions and offer insights into the tasks that the gemma-2b model does, such as factual retrieval and arithmetic.
PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers.	Using sophisticated data analysis, decision QA is a new role for LLMs that identifies the optimal decisions.
ChangeViT: Unleashing Plain Vision Transformers for Change Detection.	A methodology called ChangeViT makes use of vision transformers (ViTs) to identify significant environmental changes in remote sensing photos.
LayerMerge: Neural Network Depth Compression through Layer Pruning and Merging.	LayerMerge is a novel technique that simultaneously prunes activation functions and convolution layers to increase neural network efficiency.
Adversarial Attacks on Multimodal Agents.	Vision-enabled language models (VLMs) such as Gemini and GPT-4o enable autonomous agents to perform activities like code editing and buying. This investigation demonstrates how susceptible these agents are to malevolent attacks.
TimeSieve: Extracting Temporal Dynamics through Information Bottlenecks.	A novel model called TimeSieve was created to address typical problems in time series forecasting.

News

Link	description
Apple to ‘Pay’ OpenAI for ChatGPT Through Distribution, Not Cash.	The collaboration between Apple and OpenAI isn't anticipated to bring in a significant amount of money for either company, at least not right away. Apple is not paying OpenAI as part of the agreement because it feels that integrating OpenAI's technology and brand into its products is as valuable as or more valuable than financial compensation. The agreement isn't exclusive; Apple is already talking about providing additional chatbot choices. In the long run, Apple intends to profit from AI by entering into revenue-sharing contracts with AI partners.
AI will make money sooner than you’d think, says Cohere CEO Aidan Gomez.	Enterprise is the pathway to profit, Gomez says, but maybe don’t ask it to do medicine quite yet.
Fake beauty queens charm judges at the Miss AI pageant.	An AI model from Romania named Aiyana Rainbow is a finalist in the first Miss AI pageant, which showcases AI-generated models on social media. The event is a part of "The FanVue World AI Creator Awards," which is organized by FanVue and highlights the talent of AI creators who can create captivating content without having to be the face of the work. The $5,000 prize package for Miss AI will include mentorship and support from the public relations community. At the end of June, the outcomes will be made public.
Elon Musk reconsiders phone project after Apple Intelligence OpenAI integration.	Elon Musk threatened to forbid any Apple devices from being used on the properties of his firms in response to Apple integrating OpenAI ChatGPT on a few of its devices.
Microsoft’s star AI chief peers into OpenAI’s code, highlighting an unusual rivalry.	Primarily, OpenAI was established as a safety net against DeepMind, the AI startup that Google purchased in 2014. However, Mustafa Suleyman, a co-founder of DeepMind, has recently been taking on an unimaginable task: delving into OpenAI's crown jewels, the proprietary algorithms that power foundation models like GPT-4, according to people familiar with the situation. This is due to the fact that Suleyman is currently Microsoft's head of AI initiatives. As part of Microsoft's multibillion-dollar investment in OpenAI, the corporation possesses the intellectual property rights to its software.
Amazon says it’ll spend $230 million on generative AI startups.	Amazon says that it will commit up to $230 million to startups building generative AI-powered applications.
McDonald’s ends AI drive-thru trial as fast-food industry tests automation.	Companies have touted AI as the future of the industry, but technology has also resulted in viral videos of wrong orders
Balance effects of AI with profits tax and green levy, says IMF.	Governments faced with economic upheaval caused by artificial intelligence should consider fiscal policies including taxes on excess profits and a green levy to atone for AI-related carbon emissions, according to the International Monetary Fund.
Introducing Gen-3 Alpha.	Runway has developed a brand-new, incredibly potent video generation model. Many of the current functions on its platform will be powered by it. You can find examples at the given URL.
DeepMind’s new AI generates soundtracks and dialogue for videos.	V2A is an AI system that DeepMind is developing to create synchronized soundtracks for videos. It generates music, sound effects, and dialogue using diffusion models trained on audio, dialogue transcripts, and video clips.
Giant Chips Give Supercomputers a Run for Their Money .	The California-based business Cerebras has proven in molecular dynamics calculations that their second-generation wafer-scale engine outperforms the fastest supercomputer in the world by a large margin. Additionally, it can infer sparse huge language models with no loss of accuracy at one-third of the energy cost of a complete model. The hardware of Cerebras allows for quick memory access and interconnects, which make both accomplishments possible. Cerebras aims to expand the scope of its wafer-scale engine applications to encompass a broader range of issues, such as airflow models surrounding cars and molecular dynamics simulations of biological processes.
Nvidia becomes world’s most valuable company amid AI boom.	Chipmaker dethrones Microsoft and Apple as stock market surge boosts valuation above $3.34tn
The ‘Godfather of AI’ quit Google a year ago. Now he’s emerged out of stealth to back a startup promising to use AI for carbon capture.	Renowned AI researchers Geoff Hinton and Max Welling have gathered a talented team to develop AI systems aimed at advancing material science for carbon capture.
Nvidia Conquers Latest AI Tests.	Nvidia's Hopper architecture-based systems excelled in two recent MLPerf AI benchmark tests, which assess the fine-tuning of large language models and the training of graph neural networks.
Perplexity AI searches for users in Japan, via SoftBank deal.	Perplexity is capitalizing on its strategic partnership with SoftBank to broaden its presence in Japan. As part of this initiative, it is providing a free year of its premium AI-powered search engine, Perplexity Pro. SoftBank's goal is to draw users by offering AI services without creating internal solutions. With a valuation of $1 billion, Perplexity is expanding its funding and investor base, which features prominent tech leaders and venture firms.
Introducing Local III.	The open-source local agent, Open Interpreter, has recently received a significant upgrade. It now has the capability to control the computer seamlessly and operates entirely offline and locally.
Introducing the Property Graph Index: A Powerful New Way to Build Knowledge Graphs with LLMs.	LlamaIndex has launched the Property Graph Index, significantly improving knowledge graph capabilities with enhanced modeling, storage, and querying features. This new index enables flexible graph construction and supports schema-guided, implicit, and free-form entity extraction. It also integrates with vector databases for hybrid searches and offers querying options through keyword expansion, vector similarity, Cypher queries, and custom traversal.
Decagon launches with $35m raised from Accel and a16z.	Decagon is developing human-like AI agents for customer support and has recently secured $30 million in Series A funding from Accel, along with $5 million in seed funding from a16z. Decagon's product manages global support for companies such as Eventbrite, Rippling, Webflow, BILT, and Substack.
London premiere of movie with AI-generated script cancelled after backlash.	Plans to show The Last Screenwriter, whose script is credited to ‘ChatGPT 4.0’, prompted complaints although the film-makers insist the feature is ‘a contribution to the cause’
OpenAI’s former chief scientist is starting a new AI company.	Ilya Sutskever is launching Safe Superintelligence Inc., an AI startup that will prioritize safety over ‘commercial pressures.’
Claude 3.5 Sonnet.	At a fifth of the cost, Claude 3.5 Sonnet outperforms Opus in performance. Plus, it's the greatest vision model out there right now. This demonstrates how much the frontier models have progressed.
Apple researchers add 20 more open-source models to improve text and image AI.	With 20 Core Machine Learning models that Apple has added to the Hugging Face open-source AI repository, the repository now includes a wider selection of public models with improved image classification and depth segmentation. These donations come after Apple earlier in the year released the four OpenELMs to Hugging Face and the Ferret big language model. The action shows Apple's dedication to developing AI capabilities and its growing involvement with the AI research community.
Factory Raises $15M Series A from Sequoia.	Led by Sequoia Capital, Factory has raised $15 million in Series A funding to grow its workforce and improve its Droids software development toolset, which leverages artificial intelligence. Its products are rapidly expanding its customer base and setting new benchmarks on the SWE-bench AI coding benchmark. With Factory, software engineering will be increasingly automated, cutting down on laborious processes and speeding up development cycles.
Optimizing AI Inference at Character.AI.	Twenty percent of Google's search volume, or 20,000 questions per second, are answered by Character AI. It operates this effectively thanks to several advancements.
Apple delays launch of AI-powered features in Europe, blaming EU rules.	Apple says competition rules that require functionality with rival products would compromise privacy and security

Resources

Link	description
Nemotron-4 340B.	offers a reward model to filter data based on many qualities and an instruct model to generate high-quality data; exhibits impressive results on widely-used benchmarks such as MMLU and GSM8K; It competes with GPT-4 in a number of activities, such as scoring highly in multi-turn chat; Together with the base model, a preference data is also made available.
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs.	Determining ways to incorporate search into language model creation is now the Holy Grail of study. This work is quite encouraging as it demonstrates that on math performance, tiny models with search can match considerably more powerful models.
MCTSr: Mathematic as a Blackbox for LLM.	The MCT Self-Refine (MCTSr) algorithm integrates Large Language Models (LLMs) with Monte Carlo Tree Search (MCTS) to enhance performance in complex mathematical reasoning tasks by leveraging systematic exploration and heuristic self-refine mechanisms. Extensive experiments show that MCTSr significantly improves success rates on Olympiad-level mathematical problems, advancing the application of LLMs in strategic reasoning and decision-making.
VideoGPT.	To improve video understanding, a model called VideoGPT+ combines image and video encoders. While video encoders offer temporal context, image encoders capture finely detailed spatial information.
Scene Graph Generation in Large-Size VHR Satellite Imagery: A Large-Scale Dataset and A Context-Aware Approach.	To enhance Scene Graph Generation (SGG) for very-high-resolution satellite imaging (VHR SAI), this research introduces a new dataset and methodology.
LLM.Mojo.	This project is a port of Andrej Karpathy's llm.c to Mojo, currently in beta and subject to changes.
Depth Anything V2.	With the use of artificial data, the new Depth Anything model was trained, and its performance on intricate scenes has significantly increased.
DeepSeek-Coder-V2.	Robust DeepSeek Coder achieves scores of 90+ on HumanEval and matches GPT-4 Turbo on numerous other difficult benchmarks. It is free for business usage and accessible via an API.
HelpSteer2: Open-source dataset for training top-performing reward models.	Along with an excellent paper about training reward models to match model output to human preferences, Nvidia has made available a dataset and procedure.
Differentiable rasterization.	Given a program that produces a vector representation of an image (think SVG), rasterization turns it into a pixel representation (think PNG). Everything ought to be adjustable. This article explains how to write SVG light that is differentiable.
LARS - The LLM & Advanced Referencing Solution.	LARS is an application that enables you to run LLMs (Large Language Models) locally on your device, upload your own documents, and engage in conversations wherein the LLM grounds its responses with your uploaded content.
Beyond the Basics of Retrieval for Augmenting Generation.	The RAGatouille creator delivered a great discussion about COLBERT, some of the open issues, and how to significantly increase RAG performance.
TokenCost.	Tokencost helps calculate the USD cost of using major Large Language Model (LLM) APIs by calculating the estimated cost of prompts and completions.
GaiaNet node.	Install and run your own AI agent service.
Meta Chameleon.	Chameleon is an early fusion model that processes images and text tokens concurrently. The team published the paper a few weeks ago and has now released model checkpoints along with inference code.
OGNI-DC: Robust Depth Completion with Optimization-Guided Neural Iterations.	OGNI-DC is a new framework for depth completion that employs "Optimization-Guided Neural Iterations" (OGNI). This method refines a depth gradient field and incorporates the depth gradients into a depth map.
Subobject-level Image Tokenization.	Subobject tokenization is a novel approach for vision models to interpret images. Rather than dividing images into fixed square patches, this method allows models to analyze images by identifying meaningful segments, such as parts of objects.
Introduction to Granite Code Models.	We introduce the Granite series of decoder-only code models for code generative tasks (e.g., fixing bugs, explaining code, documenting code), trained with code written in 116 programming languages. A comprehensive evaluation of the Granite Code model family on diverse tasks demonstrates that our models consistently reach state-of-the-art performance among available open-source code LLMs.
FireFunction V2: Fireworks Function Calling Model.	Open model that matches GPT4-o on function calling benchmarks trained on top of Llama 3 70B.
Argilla.	For AI developers and subject matter experts who need complete data ownership, high-quality outputs, and overall efficiency, Argilla offers a platform for cooperation.
TroL: Traversal of Layers for Large Language and Vision Models.	Large language and vision models (LLVMs) with sizes of 1.8B, 3.8B, and 7B parameters are part of the new TroL family of efficient LLVMs.
Dot.	A stand-alone open-source program designed to be simple to use for local LLMs, and specifically RAG, to interact with files and documents in a manner similar to Nvidia's Chat with RTX.
WebCanvas: Benchmarking Web Agents in Online Environments.	WebCanvas is a pioneering online evaluation framework designed to address the dynamic nature of web interactions. It provides a realistic assessment of autonomous web agents by utilizing live web environments and emphasizing task completion through the identification of key nodes.
CIFAR-10 Airbench.	A benchmark for image classification is CIFAR-10. In a remarkably short amount of time, this algorithm offers a training setting that yields good performance.
Cost Of Self Hosting Llama-3 8B-Instruct.	Compared to using ChatGPT, self-hosting an LLM such as Llama-3 8B-Instruct can be much more expensive, costing approximately $17 per million tokens, while ChatGPT just costs $1 per million tokens. It is possible to lower the cost of self-hosted hardware to less than $0.01 per million tokens, but it would take about 5.5 years for the initial investment to pay for itself.
GeoBench: Benchmarking and Analyzing Monocular Geometry Estimation Models.	Modern surface normal estimate and depth models are assessed using a new benchmark.
An Empirical Study of Mamba-based Language Models.	The Nvidia research that previously showcased the hybrid basic Mamba model is now available.

Perspectives

Link	description
Computer says yes: how AI is changing our romantic lives.	Artificial intelligence is creating companions who can be our confidants, friends, therapists and even lovers. But are they an answer to loneliness or merely another way for big tech to make money?
Nvidia’s New Sales Booster: The Global Push for National AI Champions.	Governments everywhere are increasing their spending to entice corporations and multinationals to construct new data centers and renovate existing ones so that AI can be developed locally and massive language models can be trained in the original languages using data from their inhabitants. According to Nvidia, these independent AI initiatives should generate over $10 billion in revenue this year. The potential economic effects of generative AI are a source of concern for several governments. For their sensitive data and AI infrastructure, they want sovereign clouds, and US IT companies are happy to construct them for them.
General Intelligence (2024).	What is lacking and what would it take to create a generally intelligent agent? This essay suggests that we will be here in a few years and examines the three concepts required to create an agent. The writer is an OpenAI researcher.
Human neuroscience is entering a new era — it mustn’t forget its human dimension.	The field is taking a leap forward thanks to innovative technologies, such as artificial intelligence. Researchers must improve consent procedures and public involvement.
AI and Euro 2024: VAR is shaking up football — and it’s not going away.	Sports physicist Eric Goff explains how updates to the technology can help referees make the toughest calls.
How cutting-edge computer chips are speeding up the AI revolution.	Engineers are harnessing the powers of graphics processing units (GPUs) and more, with a bevy of tricks to meet the computational demands of artificial intelligence.
Apple’s Intelligent Strategy.	Apple showed off an incredible strategic edge in the AI arms race - but some might have missed that the company hints at using its biggest weakness as a formidable weapon against competitors.
How to Fix “AI’s Original Sin”.	The copyright issues raised by AI models trained on protected content without authorization are discussed in this article. It advises AI developers to adhere to copyright signals, put in place safeguards to stop producing content that violates intellectual property rights and design business plans that guarantee just compensation for content creators. These strategies include retrieval-augmented generation (RAG) and the development of collaborative AI content ecosystems.
Takeaways from OpenAI and Google's May announcements.	With the introduction of sophisticated AI models by OpenAI and Google, real-time multimodal understanding and answers are now possible and enhanced AI assistants and advancements in speech agents are promised. Google's Gemini 1.5 Flash offers a notable reduction in latency and cost, while OpenAI's GPT-4o promises double the speed and half the cost of its predecessor. Both digital behemoths are incorporating AI into their ecosystems, with OpenAI focusing on consumer markets with partnerships and products that could potentially reach up to a billion consumers.
Collection of AI Side Business Money-Making Information.	There are some respectable AI projects on this list that even beginners can work on.
paramount.	Paramount lets your expert agents evaluate AI chats

Back to index

ML news: Week 10 - 16 June

Research

Link	description
Scaling neural machine translation to 200 languages.	Based on a sparsely Gated Mixture of Experts architecture and trained on data using a method designed for low-resource languages, presents a massive multilingual model that uses transfer learning across 200 languages. It evaluates 40K translations and achieves an average 44% improvement in translation quality.
MatMul-free LLMs.	claims that memory consumption can be reduced by more than 10x by using an optimized kernel during inference; suggests an implementation that removes matrix multiplication operations from LLMs while maintaining performance at billion-parameter scales; the performance gap between full precision Transformers and the MatMul-free models narrows as the model size increases.
Buffer of Thoughts .	utilizes a meta-buffer containing high-level thoughts (thought templates) extracted from problem-solving processes to present a thought-augmented reasoning approach that improves the accuracy, efficiency, and robustness of LLM-based reasoning. The relevant thought template is then retrieved and instantiated with task-specific reasoning structures for the thought-augmented reasoning process. It shows SOTA performance on 10 difficult tasks at 12% of the cost of multi-query prompting methods such as Tree-of-Thoughts.
SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales.	supervised finetuning on a dataset containing summaries of the differences between multiple reasoning chains is performed by the training framework to teach LLMs to express more accurate fine-grained confidence estimates and self-reflective rationales. Reinforcement learning is then applied to calibrate confidence estimates, encouraging the LLM to produce accurate, high-confidence predictions and penalizing overconfidence in erroneous outputs.
The Geometry of Categorical and Hierarchical Concepts in Large Language Models.	investigates the geometry of categorical concepts and how the hierarchical relations between them are encoded in LLMs. It discovers that the hierarchical structure is reflected in the representation of complex concepts by polytopes made from direct sums of simplices, while simple categorical concepts are represented as simplices by the LLMs.
Show, Don't Tell: Aligning Language Models with Demonstrated Feedback.	suggests a technique that uses a very small number of demonstrations as feedback to align LLMs to a particular setting; it outperforms few-shot prompting, SFT, and self-play methods on the tested benchmarks and aligns LLM outputs to a user's demonstrated behaviors. Additionally, it can learn fine-grained style and task alignment across domains.
Towards Scalable Automated Alignment of LLMs.	gives a summary of the techniques used to align LLMs and examines the four orientations listed below 1) Inductive bias alignment; 2) Behavior imitation alignment; 3) Model feedback alignment; and 4) Environment feedback alignment
AgentGym: Evolving Large Language Model-based Agents across Diverse Environments.	a novel framework with multiple tasks and contexts for wide-ranging, concurrent, and real-time agent exploration; constructs a generally competent LLM-based agent with the ability to self-evolve and investigates its potential beyond data that hasn't been seen before across tasks and environments.
Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment.	A Synthetic-Domain Alignment (SDA) framework has been developed by researchers to improve test-time adaptation (TTA) techniques. By fine-tuning pretrained models with synthetic data produced by a conditional diffusion model, SDA efficiently aligns source and synthetic domains.
ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization.	Reward-based Noise Optimization (ReNO) is a novel technique to improve Text-to-Image (T2I) models during inference by employing signals from reward models with human preferences to optimize the baseline noise.
YOLO-World: Real-Time Open-Vocabulary Object Detection.	With YOLO-World, researchers have improved the widely used YOLO object detectors and included open-vocabulary detection. This method, which combines large-scale dataset training with vision-language modeling, enables it to swiftly and accurately detect a wide range of objects, even in situations for which it was not designed.
Improved Scene Landmark Detection for Camera Localization.	Using distinctive scene landmarks, researchers have developed a novel, privacy-friendly technique for camera localization. This method, which does not rely on real 3D point clouds for localization, is very accurate and storage-efficient since it makes use of 3D scene landmarks and a CNN-based heatmap.
Proofread: Fixes All Errors with One Tap.	The Gboard team has described how they correct sentence- and paragraph-level problems in written text on the device using SFT on a PaLM2-XS model. They discovered that latency optimizations led to significant gains in utilization.
BitsFusion: 1.99 bits Weight Quantization of Diffusion Model.	Using a new quantization approach, the Snap Research team was able to increase speed while reducing the size of the Stable Diffusion UNet model from 1.72 GB to 219 MB. Although the quantization technique is a little complicated, it shows great promise for generative model execution on consumer hardware.
Introducing Apple’s On-Device and Server Foundation Models.	During WWDC 2024, Apple debuted "Apple Intelligence". Apple Intelligence is an AI system that is built into macOS Sequoia, iOS 18, and iPadOS 18. It has sophisticated generative models for a variety of commonplace activities, like text refinement, picture generation, and notification summary. With an emphasis on user privacy and responsible AI development, this system integrates cloud and on-device capabilities to improve the user experience across all Apple products.
OVMR: Open-Vocabulary Recognition with Multi-Modal References.	OVMR is a novel approach that combines textual descriptions with sample photos to improve open-vocabulary recognition.
Predictive Dynamic Fusion.	The Predictive Dynamic Fusion (PDF) architecture solves stability and reliability problems to improve multimodal learning.
Compute Better Spent: Replacing Dense Layers with Structured Matrices.	The Linear layers are where Transformer computation is primarily done. This approach creates a structured representation with better scaling laws than naive dense layers, using less CPU than muP and Monarch matrices.
CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models.	A thorough methodology called CARES is used to assess the reliability of Medical Large Vision Language Models (Med-LVLMs).
Learning to Route Among Specialized Experts for Zero-Shot Generalization.	PHATGOOSE is an approach that dramatically increases an AI's capacity to generalize and learn new tasks without prior exposure by efficiently routing between different specialized language models for each portion of a task.
Diabetic Retinopathy Detection.	A unique framework that enhances the grading of diabetic retinopathy (DR), a condition that can result in visual impairment, has been developed by researchers.
BERTs are Generative In-Context Learners.	In a different universe, BERT models—rather than their decoder-only GPT counterparts—would have been shown to be in-context learners. When that is the case, as this paper investigates, BERTs perform remarkably well in information retrieval but poorly in knowledge acquisition, most likely as a result of the bidirectional attention mechanism.
TextGrad: Automatic "Differentiation" via Text.	The concept of treating a language model that is capable of updating text as a backpropagation system is investigated in this study. The benchmark performance, not computationally matched against baseline models, shows significant increases, according to the researchers.
Improve Mathematical Reasoning in Language Models by Automated Process Supervision.	DeepMind found a great way to extend the labor-intensive process of process oversight that requires human intervention. With robust base models, it was able to automate a significant portion of the procedure, which resulted in significant mathematical reasoning performance on Gemini Pro tuned models.
Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation.	For image generation, Llama Gen is an autoregressive model that scales better than diffusion alternatives. By using ImageNet to train a class-conditioned model, its researchers were able to raise the bar for FID.
When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models.	In order to address the efficiency concerns in autoregressive big language models, researchers have looked into combining speculative decoding with linear attention techniques. In order to improve training and performance, this work presents an augmentation strategy for linear attention that is consistent with speculative decoding.
What If We Recaption Billions of Web Images with LLaMA-3?	Using a vision model to caption online scraped photos significantly enhances downstream model performance. This is particularly valid for models like CLIP.
Hearing Anything Anywhere.	This research presents DiffRIR, a new framework that uses a planar scene reconstruction with a limited number of room impulse response (RIR) recordings to recreate the spatial acoustic properties of environments.
Simple and Effective Masked Diffusion Language Models.	By using an efficient training recipe and incorporating a simpler Rao-Blackwellized objective, researchers have shown that masked discrete diffusion models can compete with autoregressive approaches in language modeling.

News

Link	description
First NHS physiotherapy clinic run by AI to start this year.	New platform to provide same-day appointments with digital physiotherapist in effort to cut waiting times
Apple to launch iOS 18 AI features marketed as ‘Apple Intelligence’.	Bloomberg’s Mark Gurman today reports that Apple will launch its upcoming AI initiatives in iOS 18 and other operating systems under the brand name ‘Apple Intelligence’, which is obviously a convenient twist on the ‘AI’ acronym.
Claude’s Character.	Claude is not simply your average, sycophantic AI that nods in agreement with the user. A character version of Constitutional AI has been specifically used to create Claude's personality and character. This essay goes into great detail on how Claude uses post-training to control the kind of output that he typically produces in order to portray this desired character.
Databricks + Tabular.	With the acquisition of Tabular, Databricks has brought together major players from Apache Iceberg and Delta Lake to concentrate on data format interoperability for its lakehouse architecture. With Delta Lake UniForm's compatibility solution at the forefront, the objective is to establish a single, open standard for data interoperability in order to prevent data silos.
How the voices for ChatGPT were chosen.	We worked with industry-leading casting and directing professionals to narrow down over 400 submissions before selecting the 5 voices.
OpenAI and Apple announce partnership to integrate ChatGPT into Apple experiences.	Apple is integrating ChatGPT into experiences within iOS, iPadOS, and macOS, allowing users to access ChatGPT’s capabilities—including image and document understanding—without needing to jump between tools.
Apple Intelligence: every new AI feature coming to the iPhone and Mac.	pple announced “Apple Intelligence” at WWDC 2024, its name for a new suite of AI features for the iPhone, Mac, and more. Starting later this year, Apple is rolling out what it says is a more conversational Siri, custom, AI-generated “Genmoji,” and GPT-4o access that lets Siri turn to OpenAI’s chatbot when it can’t handle what you ask it for.
Asana says its new AI teammates are ready to manage your projects.	With the goal of enhancing productivity and output quality, Asana has introduced "AI teammates" to take care of duties like proactive project detail organization and request triaging. This innovative feature is integrated into the workflow and functions like a human team member while yet being supervised by humans. It was showcased at Asana's Work Innovation Summit.
Apple stock reaches record high after the announcement of new AI features.	Tech giant’s shares climb 7% a day after reveal of artificial intelligence features meant to increase appeal of the iPhone
Elon Musk abruptly withdraws lawsuit against Sam Altman and OpenAI.	Tesla CEO had accused the company of abandoning mission of creating artificial intelligence for the greater good of humanity
Mistral raises €600m series B.	Mistral announced €600M in Series B funding for their first anniversary
Mozilla Builders.	Local AI, which enhances accessibility and privacy by bringing AI models and applications directly onto personal devices, is being embraced by the first Mozilla Builders Accelerator. Tools for developer productivity, locally based AI agents, dynamic user interfaces, fine-tuning adaption, retrieval-augmented creation, and enhanced function calling are some of the key areas of advancement. The initiative's goal is for participants to create an open-source, decentralized AI ecosystem with a focus on user empowerment.
CaseMark Raises $1.7M to Empower Attorneys with AI.	In order to increase the scope of its AI solutions for the legal sector, Gradient Ventures led the pre-seed investment in CaseMark, an AI firm that is transforming legal operations.
OpenAI ex-employees worry about company’s control over their millions of dollars in shares.	With OpenAI’s valuation soaring and an IPO nowhere in sight, the company is giving employees the chance to sell some equity in secondary transactions. Ex-employees sitting on millions of dollars worth of stock worry about OpenAI’s ability to force them to give up their shares, according to sources and internal messages. OpenAI recently circulated a document indicating that ex-employees who work at competitors are not included in the tender offers.
Announcing the Open Release of Stable Diffusion 3 Medium.	Stable Diffusion 3 Medium is Stability AI’s most advanced text-to-image open model yet. The small size of this model makes it perfect for running on consumer PCs and laptops as well as enterprise-tier GPUs.
Shutterstock ImageAI, Powered by Databricks.	Databricks and Shutterstock announced a text-to-image Generative AI model optimized for enterprise use
OpenAI Annualized Revenue Doubles.	OpenAI has more than doubled its annualized revenue to hit $3.4B.
Perplexity was planning revenue-sharing deals with publishers when it came under media fire.	Perplexity, the AI search startup that recently came under fire from Forbes for allegedly misusing its content, was already working on revenue-sharing deals with high-quality publishers.
Microsoft’s Nadella Is Building an AI Empire. OpenAI Was Just the First Step.	After landing the deal that launched his company to the front of the artificial intelligence race, the tech chief is spreading his bets. Will it be enough?
OpenAI adds former NSA chief to its board.	OpenAI said on Thursday that it is adding former NSA head and retired Gen. Paul Nakasone to its board of directors as well as its newly formed Safety and Security Committee. Why it matters: OpenAI is looking to convince skeptics that it is taking sufficient steps to ensure its models are safe as it works toward its goal of superintelligence.
Apple Made Once-Unlikely Deal With Sam Altman to Catch Up in AI.	An OpenAI agreement is due to be announced at the Apple’s developer conference next week.
LLM-Squared .	Sakana AI has found a preference optimization scheme that works better than DPO by using an evolutionary approach. It trained models based on code that was suggested by a language model. It has a few suggested variations with very high performance after about 100 generations.
Gemini 1.5 Pro and 1.5 Flash GA, 1.5 Flash tuning support, higher rate limits, and more API updates.	Updates to the Gemini API and Google AI Studio have been released by Google AI. These include support for model tuning, the stable release of Gemini 1.5, increased API rate limitations, additional JSON schema features, and mobile compatibility. The changes boost the alternatives available to developers more efficiently and more customized large-scale buildings.
AI generated sound effects are here.	A new AI audio model from ElevenLabs can generate a variety of voices, tunes, and sound effects based on text cues. By utilizing Shutterstock's audio library, our partnership helps media professionals create better content by facilitating the quick and scalable production of high-quality audio. ElevenLabs' platform makes it simple for users to create sounds, which streamlines the audio design process.
OpenAI welcomes Sarah Friar (CFO) and Kevin Weil (CPO).	With the appointment of Kevin Weil as CPO and Sarah Friar as CFO, OpenAI has strengthened its leadership team to further its goal of developing AI products and doing research that is useful to developers, businesses, and consumers.
Why the pope has the ears of G7 leaders on the ethics of AI.	Pope Francis is leaning on thinking of Paolo Benanti, a friar adept at explaining how technology can change world
AI used to predict potential new antibiotics in a groundbreaking study.	Scientists used an algorithm to mine ‘the entirety of the microbial diversity’ on Earth, speeding up antibiotic resistance research

Resources

Link	description
Spreadsheet Is All You Need.	Complete GPT-2 style transformer model with all weights, parameters, and connections included in a spreadsheet. It is a tiny model that runs entirely within the rows and columns of a spreadsheet and is based on NanoGPT.
Inspectus.	Inspectus is a versatile visualization tool for large language models. It runs smoothly in Jupyter notebooks via an easy-to-use Python API. Inspectus provides multiple views, offering diverse insights into language model behaviors.
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model.	SpatialRGPT is a powerful vision-language model adept at understanding both 2D and 3D spatial arrangements. It can process any region proposal, such as boxes or masks, and provide answers to complex spatial reasoning questions.
Thread.	Thread is a Jupyter Notebook that combines the experience of OpenAI's code interpreter with the familiar development environment of a Python notebook. With Thread, you can use natural language to generate cells, edit code, ask questions or fix errors all while being able to edit or re-run code as you would in a regular Jupyter Notebook.
How AI Image Models Work.	Since 2022, AI image production has advanced beyond producing images with text explanations. This article illustrates the quick progress and promise of AI in visual creation by explaining how these models hone chaotic inputs to create precise and detailed visuals using a kid's game comparison.
Active Stereo Without Pattern Projector.	Without the need for a hardware pattern projector, researchers have presented a new framework that incorporates active stereo concepts into passive cameras that are commonly used.
GLM-4-9B-Chat.	Excellent model with support for 26 languages, trained on 10T tokens by the Tsinghua KEM group.
DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data.	DIRECT-3D is a new text-to-3D generative model that directly generates 3D contents in a single forward pass without optimization.
Together MoA.	Together has presented Mixture of Agents (MoA), a cutting-edge technique that mixes many LLMs for optimal performance, outperforming GPT-4o with an AlpacaEval 2.0 score of 65.1%. MoA employs a tiered architecture in which aggregators in later levels improve the initial answers from different models, improving output quality through cooperation. Even with improved precision, MoA still struggles with latency. Reducing latency and improving model design are two potential future possibilities.
Mistral.rs.	Mistral.rs is a fast LLM inference (Rust-based inference framework) platform supporting inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.
Generalizable Human Gaussians from Single-View Image.	A diffusion-guided framework for building 3D human models from a single image is the Human Gaussian Model (HGM).
Lighting Every Darkness with 3DGS: Fast Training and Real-Time Rendering for HDR View Synthesis.	Real-time HDR view synthesis from RAW pictures can be achieved with the LE3D approach. It works especially well for situations set at night.
TORAX.	The Python-Jax differentiable fusion tokamak simulator developed by DeepMind at Google is now publicly available. The simulator supports several very powerful PDEs and has good auto-diff capabilities.
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising.	A novel acceleration approach called AsyncDiff makes it possible to perform parallel processing in diffusion models. By splitting the noise prediction model into several parts and executing them on different devices, it drastically cuts latency without sacrificing quality.
PowerInfer-2: Fast Large Language Model Inference on a Smartphone.	Fast inference on the phone for the special Mistral 47B MoE model.
The AXLearn Library for Deep Learning.	AXLearn is a library built on top of JAX and XLA to support the development of large-scale deep-learning models.
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling.	Samba is a simple yet powerful hybrid model with an unlimited context length. Its architecture is frustratingly simple: Samba = Mamba + MLP + Sliding Window Attention + MLP stacking at the layer level.
DiffusionKit.	Framework and tooling for running diffusion models on Apple's MLX framework.
Splash Attention.	new DeepMind kernel in Jax for Sparse Flash Attention
Hugging Face acquires Agrilla.	Argilla a company specialized in data for preference optimization has been acquired.

Perspectives

Link	description
Building AI products.	Though they can't give exact answers to questions, large language models (LLMs) like ChatGPT are excellent at producing responses that seem correct. In order to improve user experience and enhance functionality while reducing errors, AI in the future will integrate LLMs into specialized tools or embed them into already-existing applications. This will contextualize AI outputs within controllable, specified areas.
Why passwords still matter in the age of AI.	As Apple’s new Passwords app tries to solve our identity crisis, why are we still proving who we are via strings of random characters?
Examining LLM performance on public benchmarks.	Popular LLMs on public benchmarks: how overfit are they? Mistral and Phi are overfitting benchmarks, but GPT, Claude, Gemini, and Llama are not, according to new research from Scale AI SEAL. The scientists assessed public LLMs for overfitting on GSM8k and created a new eval GSM1k.
How to track the economic impact of public investments in AI.	National statistics systems should recognize the researchers whose ideas drive artificial intelligence applications, not just machines and factory outputs.
Maintaining Large-Scale AI Capacity At Meta.	To meet AI demands, Meta is modernizing its data centers throughout the world. For AI training tasks, it intends to scale to 600,000 GPUs. In order to assure minimal disruptions and constant performance while enabling quick infrastructure scalability, this calls for creative maintenance tactics and tools like OpsPlanner.

Back to index

ML news: Week 3 - 9 June

Research

Link	description
Contextual Position Encoding: Learning to Count What's Important.	The general position encoding method can attend to the i-th particular word, noun, or sentence; it improves perplexity on language modeling and coding tasks; it is context-dependent and can represent different levels of position abstraction; it suggests a new position encoding method, CoPE, to enable the position to be conditioned on context by incrementing position only on certain tokens.
Faithful Logical Reasoning via Symbolic Chain-of-Thought.	suggests a way to enhance LLMs' capacity for logical thinking by combining logical rules and symbolic expressions with chain-of-thought (CoT) prompting; this prompting method is known as Symbolic Chain-of-Thought and it is a fully LLM-based framework that consists of the following important steps: converts the context of natural language to symbolic format, 2) creates a step-by-step solution plan based on symbolic logical rules, and 3) employs a verifier to validate the translation and reasoning chain.
Transformers Can Do Arithmetic with the Right Embeddings.	The main problem this work addresses is the inability of transformers to track the exact position of digits; they do this by adding an embedding to each digit that encodes its position relative to the start of the number; these gains also transfer to multi-step reasoning tasks that include sorting and multiplication. achieves 99% accuracy on 100-digit addition problems by training on only 20-digit numbers with a single GPU.
GNN-RAG: Graph Neural Retrieval for Large Language Model Reasoning.	blends the reasoning powers of GNNs with the language understanding skills of LLMs in a RAG fashion; the GNN extracts relevant and useful graph information, and the LLM uses the information to answer questions over knowledge graphs (KGQA); GNN-RAG outperforms or matches GPT-4 performance with a 7B tuned LLM, and improves vanilla LLMs on KGQA.
Attention as an RNN.	is based on the parallel prefix scan algorithm, which enables efficient computation of attention's many-to-many RNN output. It achieves comparable performance to Transformers on 38 datasets while being more time and memory-efficient. presents a new attention mechanism that can be trained in parallel (like Transformers) and updated with new tokens requiring constant memory usage for inferences (like RNNs).
Are Long-LLMs A Necessity For Long-Context Tasks?	suggests a reasoning framework to allow short-LLMs to handle long-context tasks by adaptively accessing and utilizing the context based on the tasks presented; it breaks down the long context into short contexts and processes them using a decision-making process. The argument claims that long LLMs are not necessary to solve long-context tasks.
Sparse maximal update parameterization: A holistic approach to sparse training dynamics.	All frontier model labs use muP, a potent tool, to transfer hyperparameters fine-tuned on tiny models to bigger, more costly training runs. This study investigates how to achieve that for sparse models, resulting in significantly better training results and lower computation expenses.
Exploring Color Invariance through Image-Level Ensemble Learning.	To address color bias in computer vision, researchers have created a novel learning technique called Random Color Erasing. By selectively excluding color information from training data, this technique strikes a balance between the significance of color and other parameters, producing models that perform better in challenging situations like industrial and wide-area surveillance.
Conifer: Improving Complex Constrained Instruction-Following Ability of Large Language Models.	Conifer enhances LLMs' comprehension of intricate instructions by utilizing a progressive learning methodology and a customized dataset.
LLM Merging Competition: Building LLMs Efficiently through Merging.	Sakana AI is sponsoring the LLM Merging challenge at NeurIPS this year.
Tribeca to Screen AI-Generated Short Films Created by OpenAI’s Sora.	Short films generated by artificial intelligence are popping up at more and more film festivals, and the largest event yet is dedicating an entire section to AI-generated movies.
Adapting Large Multimodal Models to Distribution Shifts: The Role of In-Context Learning.	A technique called InvariantSelectPR is intended to make Large Multimodal Models (LMMs) more adaptive in domain-specific fields such as healthcare.
TAIA: Large Language Models are Out-of-Distribution Data Learners.	A technique called TrainAllInfAttn improves the performance of big language models in niche markets with little data.
MegActor: Harness the Power of Raw Video for Vivid Portrait Animation	A new model called MegActor uses unprocessed driving videos to create more lifelike portrait animation. It addresses identity leaking and background interference and produces remarkable results with a unique data creation framework and background encoding approaches.
MeshXL: Neural Coordinate Field for Generative 3D Foundation Models.	MeshXL is a new model that generates high-quality 3D meshes.
Position-Guided Prompt Learning for Anomaly Detection in Chest X-Rays.	Position-guided Prompt learning method for Anomaly Detection in Chest X-rays (PPAD). PPAD leverages learnable text prompts and image prompts to minimize the gap between pre-training data and task-specific data. Through position-guided prompts, the model can focus on various regions, simulating the diagnostic process of experts.
Tree Diffusion: Diffusion Models For Code.	Wonderful diffusion paper that diffuses picture code. As part of the diffusion process, it can be directly edited. Although it is sluggish, it can be simply used with search to significantly increase one's capacity for reasoning.
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models.	Expanding upon the Greedy Coordinate Gradient (GCG) approach, researchers have enhanced methods for optimization-based jailbreaking of huge language models.
ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation.	A training-free video interpolation technique for generative video diffusion models has been developed by researchers. This novel method improves frame rates without requiring a lot of training or big datasets and works with different models.
A whole-slide foundation model for digital pathology from real-world data.	Prov-GigaPath, a whole-slide pathology foundation model pre-trained on 1.3 billion 256 × 256 pathology image tiles in 171,189 whole slides. To pretrain Prov-GigaPath, we propose GigaPath, a novel vision transformer architecture for pretraining gigapixel pathology slides. We further demonstrate the potential of Prov-GigaPath on vision–language pretraining for pathology by incorporating the pathology reports. In sum, Prov-GigaPath is an open-weight foundation model that achieves state-of-the-art performance on various digital pathology tasks, demonstrating the importance of real-world data and whole-slide modeling.
DreamMat: High-quality PBR Material Generation with Geometry- and Light-aware Diffusion Models.	Using Dream Mat to enhance 3D object texture production is a brilliant idea. Given a 3D model, it employs several traditional graphic methods including Metallic, Roughness, and Albedo to generate a very appealing result.
LlamaCare: A Large Medical Language Model for Enhancing Healthcare Knowledge Sharing.	To solve classification problems in large language models (LLMs), researchers have developed LlamaCare, a refined LLM for medical information, in conjunction with Extended Classification Integration (ECI).
XRec: Large Language Models for Explainable Recommendation.	XRec is a framework independent of models that improves explainable recommender systems by utilizing the language capabilities of huge language models.
MetaMixer Is All You Need.	Using simply convolutions, researchers have created a novel method called FFNification that preserves the query-key-value structure while converting self-attention processes into more effective token mixers.
GrootVL: Tree Topology is All You Need in State Space Model.	By dynamically constructing a tree topology based on spatial correlations and input information, GrootVL is a network that enhances state space models.
ProGEO: Generating Prompts through Image-Text Contrastive Learning for Visual Geo-localization.	In order to increase Visual Geo-localization (VG) and boost its performance in applications such as SLAM, augmented reality, and autonomous driving, researchers have created a new two-stage training process.
ReLUs Are Sufficient for Learning Implicit Neural Representations.	A review of the application of ReLU activation functions to implicit neural representations (INRs) learning has been conducted. They countered spectrum bias by introducing basic limitations to ReLU neurons, which were inspired by second-order B-spline wavelets.

News

Link	description
OpenAI Is Restarting Its Robotics Research Group.	The San Francisco-based company has been a pioneer in generative artificial intelligence and is returning to robotics after a three-year break.
AI Overviews: About last week.	In order to improve search results and give users more precise and pertinent information, particularly for complex inquiries, Google created AI Overviews. While there were certain problems, such as incorrect results and misread content, Google has fixed these difficulties with over a dozen technical updates, like improving the identification of absurd questions and reducing the amount of user-generated content in AI Overviews.
Nvidia said to be prepping AI PC chip with Arm and Blackwell cores.	Competition could be heating up in the Windows on Arm space amid talk in the industry that Nvidia is readying a chip pairing next-gen Arm cores with its Blackwell GPU architecture.
Ex-OpenAI board member reveals what led to Sam Altman's brief ousting.	In a recent interview, former OpenAI board member Helen Toner provided fresh information into the circumstances surrounding CEO Sam Altman's November dismissal. It appears that the board was informed via Twitter about the release of ChatGPT. According to Toner, Altman had repeatedly lied to the board. It has been alleged that Altman had been lying about events within the organization for years and hiding facts. The board found it difficult to make decisions as a result of his lies, and they concluded that he wasn't the best person to take the firm to AGI.
AI hardware firm Nvidia unveils next-gen products at Taiwan tech expo.	CEO Jensen Huang tells packed stadium in Taipei ‘next Industrial Revolution has begun’
AMD unveils new AI chips to compete with Nvidia.	AMD has been vying to compete against Nvidia, which currently dominates the lucrative market for AI semiconductors and commands about 80% of its share.
Anthropic’s Claude 3 Opus and tool use are generally available on Vertex AI.	Google Cloud now offers Claude 3 Opus with tool use along with the smaller models as part of its Vertex AI offering.
State Space Duality (Mamba-2).	Mambda is an effective model of state space. A lengthy and comprehensive explanation of the model and its enhancements is included in the second version that its team has issued.
No physics? No problem. AI weather forecasting is already making huge strides.	With AI models like WindBorne's WeatherMesh, which leverages the extensive ERA5 dataset to outperform conventional models while using much less processing power, the weather forecasting industry is transforming.
Amazon’s Project PI AI looks for product defects before they ship.	Project PI combines computer vision and generative AI to catch damaged items and prevent returns.
The Opaque Investment Empire Making OpenAI’s Sam Altman Rich.	One of Silicon Valley's most active and successful individual investors is Sam Altman. At the beginning of this year, his stakes in his investment empire were valued at least $2.8 billion. A large portion of the portfolio is unknown. Readers are guided through Altman's investment knowledge in this article.
Even the Raspberry Pi is getting in on AI.	Raspberry Pi partnered with Hailo to provide an optional AI add-on to its microcomputers.
Using AI to decode dog vocalizations.	Leveraging a human speech model to identify different types of barks. University of Michigan researchers are exploring the possibilities of AI, developing tools that can identify whether a dog’s bark conveys playfulness or aggression.
The future is … sending AI avatars to meetings for us, says Zoom boss.	Eric Yuan suggests technology is five or six years away and will free up time to spend with family
AI researchers build ‘future self’ chatbot to inspire wise life choices.	Scientists at MIT hope talking to 60-year-old self will shift thinking on health, money and work
Cartwheel generates 3D animations from scratch to power up creators.	Animating a 3D character from scratch is generally both laborious and expensive, requiring the use of complex software and motion capture tools.
Mistral launches fine-tuning API.	Mistral has launched customization for its models via its platform and API.
If you aren't seeing AI Overviews in your search results, it's probably thanks to Google.	After receiving heavy criticism since their mid-May public launch, AI Overviews in Google Search have dropped in visibility across search results. Since I/O, the average percentage of queries where AI Overviews appear has dropped from 27 percent to just 11 percent. Despite the reduction, healthcare-related queries are a large percentage of AI results, raising concerns about both accuracy and reliability across Google.
Google optimizes shipping routes.	The mathematical optimization for cargo shipping routes was enhanced by Google's operations research group. They discovered a 13% drop in gasoline expenses and consumption.
BrightEdge Releases Post Google I/O Data on The Impact of AI Overviews.	The main businesses affected by AI Overviews, what generates results, and where Google automatically anticipates and responds to search inquiries are all revealed by new research from BrightEdge Generative Parser.
Nvidia emails: Elon Musk diverting Tesla GPUs to his other companies.	The Tesla CEO is accused of diverting resources from the company again. Elon Musk is yet again being accused of diverting Tesla resources to his other companies. This time, it's high-end H100 GPU clusters from Nvidia.
Securing Research Infrastructure for Advanced AI.	In its description of the security architecture of its AI training supercomputers, OpenAI highlights the use of Azure-based infrastructure and Kubernetes for orchestration to safeguard critical model weights and other assets.
Extracting Concepts from GPT-4.	The team at OpenAI has discovered 16 million interpretable features in GPT-4 including price increases, algebraic rings, and who/what correspondence. This is a great step forward for SAE interpretability at scale. They shared the code in a companion GitHub repository.
Mesop: Gradio Competition.	A rival to the well-liked AI prototyping framework Gradio has been made available by Google. Gradio is more mature than Mesop, which is pure Python and slightly more composable.
Nvidia is now more valuable than Apple at $3.01 trillion.	The AI boom has pushed Nvidia’s market cap high enough to make it the second most valuable company in the world.

Resources

Link	description
An Introduction to Vision-Language Modeling.	we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them.
Aya 23: Open Weight Releases to Further Multilingual Progress.	a family of multilingual language models with up to 23 languages supported; it demonstrates that it can perform better on those particular languages than other large-scale multimodal models by purposefully concentrating on fewer languages and allocating greater capacity to them.
Financial Statement Analysis with Large Language Models	claims that by analyzing trends and financial ratios, LLMs can produce insightful insights; demonstrates that GPT-4 outperforms more specialized models; and develops a profitable trading strategy based on GPT's predictions.
SimPO: Simple Preference Optimization with a Reference-Free Reward.	SimPO demonstrates how it outperforms other methods like DPO and claims to generate the strongest 8B open-source model. It is a more straightforward and efficient method for preference optimization with a reference-free reward; it uses the average log probability of a sequence as an implicit reward (i.e., no reference model required), which makes it more compute and memory efficient.
Experimenting with local alt text generation.	A model that runs in the browser and can provide alt text for web photos automatically has been trained by Mozilla.
Mora: More like Sora for Generalist Video Generation.	Mora is a multi-agent framework designed to facilitate generalist video generation tasks, leveraging a collaborative approach with multiple visual agents. It aims to replicate and extend the capabilities of OpenAI's Sora.
FABRIC: Personalizing Diffusion Models with Iterative Feedback.	FABRIC (Feedback via Attention-Based Reference Image Conditioning) is a technique to incorporate iterative feedback into the generative process of diffusion models based on StableDiffusion.
KL is All You Need.	KL divergence is a quick, affordable, and effective method of measuring a certain type of distance between objects. In both conventional and contemporary AI, it is widely employed. This piece examines the potent idea both mathematically and graphically.
7 Ways AI-Native Companies Can Improve User Retention.	a manual with examples of how businesses like Perplexity, Civit, Lapse, Omnivore, and others are using them to increase retention for founders and product executives.
FineWeb: decanting the web for the finest text data at scale.	The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. Recently, we released 🍷 FineWeb, a new, large-scale (15 trillion tokens, 44TB disk space) dataset for LLM pretraining. FineWeb is derived from 96 CommonCrawl snapshots and produces better-performing LLMs than other open pretraining datasets.
An entirely open-source AI code assistant inside your editor.	Continue enables you to easily create your own coding assistant directly inside Visual Studio Code and JetBrains with open-source LLMs. All this can run entirely on your own laptop or have Ollama deployed on a server to remotely power code completion and chat experiences based on your needs.
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark.	A popular benchmark for reasoning tasks is MMLU. It is frequently seen as the gold standard and as something that models overfit. A new, more rigorous, and refined benchmark called MMLU Pro is used to gauge language model reasoning.
Omost.	Omost gives you control over how your images are generated. It comes from the same designer as ControlNet. First, it rewrites the prompts into a collection of illustrative code. After that, it renders the finished image using that. Crucially, you can modify the code either prior to or following generation in order to subtly alter the model's output.
Control-GIC.	A novel generative image compression framework called Control-GIC enables fine-grained bitrate modification while preserving high-quality output.
LLM inference speed of light.	Using the theoretical speed of light modeling as grounding is extremely significant for problems where the amount of computation and memory access is known a priori as it helps assess the quality of implementations and predict the impact of architectural modifications.
Neural Surface Reconstruction.	Without the need for 3D supervision, GenS is an end-to-end generalizable neural surface reconstruction model that performs exceptionally well at reconstructing surfaces from multi-view images.
MatMul-Free LM.	Even at the billion-parameter scale, researchers have managed to remove matrix multiplication (MatMul) from huge language models without sacrificing speed.
stable-audio-open-1.0 .	The weights for Stable Audio, which was trained to produce sound effects on audio samples with permissive licenses, have been released by Stability AI.
CV-VAE: A Compatible Video VAE for Latent Generative Video Models.	With its spatio-temporally compressed latent spaces, CV-VAE is a video VAE that works with current image and video models to efficiently train new ones utilizing pre-trained ones.
Qwen2.	Pretrained and instruction-tuned models of 5 sizes, including Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B.Having been trained on data in 27 additional languages besides English and Chinese. Having been trained on data in 27 additional languages besides English and Chinese. State-of-the-art performance in a large number of benchmark evaluations
Dragonfly: A large vision-language model with multi-resolution zoom.	We are also launching two new open-source models Llama-3-8b-Dragonfly-v1 a general-domain model trained on 5.5 million image-instruction pairs and Llama-3-8b-Dragonfly-Med-v1 finetuned on an additional 1.4 biomedical image-instruction data. Dragonfly demonstrates promising performance on vision-language benchmarks like commonsense visual QA and image captioning. Dragonfly-Med outperforms prior models, including Med-Gemini on multiple medical imaging tasks, showcasing its capabilities for high-resolution medical data.
MMLU Pro.	The industry standard for assessing knowledge and reasoning in language models is MMLU.

Perspectives

Link	description
Beyond the Cloud: Distributed AI and On-Device Intelligence.	Transition of AI workflows from cloud to the edge with specialized chip infrastructure & models, multi-modality and ambiance across devices
Sure, Google’s AI overviews could be useful – if you like eating rocks.	The company that shaped the development of search engines is banking on chatbot-style summaries. But so far, its suggestions are pretty wild
AI's Communication Revolution: We're All Talking to Computers Now.	With its real-time integration of text, vision, and audio, OpenAI's GPT-4o is driving a revolution in communication through AI. As a result, human-to-AI communication has become a fundamental form of digital connection and has the potential to bring about substantial societal changes as well as the emergence of new companies focused on AI-centric communication. This transition makes it possible for more natural interactions with AI
A Right to Warn about Advanced Artificial Intelligence.	A group of AI workers, both present and past, is pleading with advanced AI companies to adopt values that guarantee openness and safeguard workers who voice concerns about risks. They emphasize how important it is for businesses to refrain from enforcing non-disparagement agreements, to make anonymous reporting procedures easier, to encourage candid criticism, and to shield whistleblowers from reprisals.
Will Scaling Solve Robotics?	The Conference on Robot Learning, which included 11 workshops and nearly 200 submitted papers, drew over 900 attendees last year. Whether it was possible to tackle robotics problems by training a huge neural network on a large data set was one of the main points of contention throughout the event. To help readers better comprehend the topic, this piece offers the opposing viewpoints. Scaling has been successful in several related domains. It is not feasible, though, because there is a lack of readily available robotics data and no obvious method for obtaining it. Scaling, even if it performs as well as it does in other domains, is probably not going to solve robotics.
Plentiful, high-paying jobs in the age of AI.	Due to comparative advantage, it's feasible that a large number of professions that humans currently perform will be performed by humans eternally, regardless of how much better AIs become at those tasks.
What I learned from looking at 900 most popular open source AI tools.	The goal of this study of open-source AI repositories is to provide readers with a broad overview of the intimidating AI ecosystem.
Meta AI system is a boost to endangered languages — as long as humans aren’t forgotten.	Automated approaches to translation could provide a lifeline to under-resourced languages, but only if companies engage with the people who speak them.
Misinformation poses a bigger threat to democracy than you might think.	In today’s polarized political climate, researchers who combat mistruths have come under attack and been labeled as unelected arbiters of truth. But the fight against misinformation is valid, warranted, and urgently required.
Is AI misinformation influencing elections in India?	A sample of roughly two million WhatsApp messages highlights urgent concerns about the spread and prevalence of AI-generated political content.
I'm Bearish OpenAI.	A shift toward products and a research brain drain should ring your alarm bells
The future of foundation models is closed-source.	If the centralizing forces of data and compute hold, open and closed-source AI cannot both dominate long-term
A Grand Unified Theory of the AI Hype Cycle.	Over the years, the AI sector has experienced multiple hype cycles, each of which produced really useful technology and outlasted the previous one. Instead of following an exponential process, every cycle adheres to a sigmoid one. There is an inevitable limit to any technology development strategy, and it is not too difficult to find. Although this AI hype cycle is unlike any other that has come before it, it will probably go in the same direction.
Hi, AI: Our Thesis on AI Voice Agents.	The current state of AI speech agents is described in a blog post and deck created by Andreessen Horowitz, along with potential areas for advancement and investment. It outlines the present state of the B2B and B2C application layer landscape and covers the top infrastructure stack.

Back to index

ML news: Week 27 May - 2 June

Research

Link	description
Golden Gate Claude.	we released a major new research paper on interpreting large language models, in which we began to map out the inner workings of our AI model, Claude 3 Sonnet. In the “mind” of Claude, we found millions of concepts that activate when the model reads relevant text or sees relevant images, which we call “features”.
A Better Match for Drivers and Riders: Reinforcement Learning at Lyft.	The Lyft team matched drivers and riders using online reinforcement learning, which is rewarded by future profits for the drivers. They made an extra $30 million a year for riders and were able to significantly improve in real-time.
Lessons from the Trenches on Reproducible Evaluation of Language Models.	Language model evaluation is a challenging task, and information on the process is scarce outside of the biggest companies. This work presents a robust and repeatable set of assessment criteria. In the appendix, there is a useful discussion of perplexity evaluation.
RectifID: Personalizing Rectified Flow with Anchored Classifier Guidance.	A novel method for tailoring diffusion models to produce identity-preserving images from user-supplied references is presented by researchers. This strategy steers diffusion models without further training by using classifier guidance, in contrast to classic methods that need considerable domain-specific training.
LoRA-Ensemble: Efficient Uncertainty Modelling for Self-attention Networks.	A parameter-efficient deep ensemble technique for self-attention networks is called LoRA-Ensemble. This method provides accurate and well-calibrated predictions without the significant computational cost associated with typical ensemble methods. It does this by extending Low-Rank Adaptation (LoRA) for implicit ensembling.
Agent Planning with World Knowledge Model.	demonstrates superior performance compared to various strong baselines when adopting open-source LLMs such as Mistral-7B and Gemma-7B. Introduces a parametric world knowledge model to facilitate agent planning. The agent model can self-synthesize knowledge from expert and sampled trajectories; this is used to train the world knowledge model. Prior task knowledge is used to guide global planning and dynamic state knowledge is used to guide local planning.
Enhancing Answer Selection in LLMs.	suggests a hierarchical reasoning aggregation framework to enhance LLMs' reasoning capabilities; the method, known as Aggregation of Reasoning (AoR), chooses answers based on the assessment of reasoning chains; AoR employs dynamic sampling to modify the number of reasoning chains in relation to task complexity; it makes use of evaluation phase results to decide whether to sample more reasoning chains; One well-known problem with majority voting is that it doesn't work when the right option is in the minority; AoR concentrates on assessing the reasoning chains to enhance the choice of the concluding response; AoR can be employed with different LLMs to enhance performance on difficult reasoning problems, and it outperforms several well-known ensemble approaches.
Efficient Inference of LLMs.	suggests a layer-condensed KV cache to achieve effective inference in LLMs; can achieve up to 26x higher throughput than baseline transformers while maintaining satisfactory performance; only computes and caches the key values (KVs) of a small number of layers, which leads to reduced memory consumption and improved inference throughput.
Mapping the Mind of a Large Language Model.	By mapping millions of features that correlate to a wide range of concepts, anthropologists have shown a way to comprehend the inner workings of its huge language model, Claude Sonnet. This interpretability, which permits certain manipulations of these attributes to direct model behaviors, may result in safer AI. The research indicates a noteworthy advancement in comprehending and enhancing the security protocols of artificial intelligence language models.
Object Segmentation in Complex Scenarios.	To enhance Generalized Referring Expression Segmentation (GRES), researchers have developed the Hierarchical Semantic Decoding with Counting Assistance (HDC) framework. As opposed to earlier techniques, HDC combines semantic correspondences and transmits complementing modality information across granularities for improved multi-level decoding.
Label-efficient Semantic Scene Completion with Scribble Annotations.	A novel semantic scene completion method called Scribble2Scene lessens the requirement for thorough labeling.
Semantic and Spatial Adaptive Pixel-level Classifier for Semantic Segmentation.	The constraints of semantic segmentation have been addressed with the introduction of a new Semantic and Spatial Adaptive (SSA) classifier. This novel method makes use of coarse masks to direct prototype adjustment, improving fine-grained recognition and delineating mask boundaries.
RSBuilding: Towards General Remote Sensing Image Building Extraction and Change Detection with Foundation Model.	By integrating building extraction and change detection into a single model, RSBuilding presents a novel method for deciphering buildings from remote sensing photos.
Meteor: Mamba-based traversal of the rationale for Large Language and Vision Models.	This research presents Meteor, a novel massive language and vision model that is efficient and employs several justifications to enhance comprehension and response times.
gzip Predicts Data-dependent Scaling Laws.	Scaling rules are a means of forecasting the performance of models at specific sizes with a given quantity of data. Getting them is costly. In order to forecast a data-dependent scaling law, this research investigates the use of the gzip compression ratio as a powerful signal.
The Road Less Scheduled.	A few weeks prior, a brand-new Meta optimizer was circulating as a possible replacement for Adam. The method, including the part about online updates, is described in more depth in this paper. Overall, this appears like a good outcome, particularly in cases when the complete number of planned training steps is not known at the start of the training process.
Transformers Can Do Arithmetic with the Right Embeddings.	Researchers have added embeddings that encode the position of each digit with respect to the start of the number, which has improved transformer performance on arithmetic tasks.
DMPlug: A Plug-in Method for Solving Inverse Problems with Diffusion Models.	DMPlug is a new plug-in technique that solves inverse problems (IPs) by using pre-trained diffusion models (DMs). DMPlug efficiently addresses both manifold feasibility and measurement feasibility by treating the reverse diffusion process as a function, in contrast to other interleaving techniques.
PatchScaler: An Efficient Patch-independent Diffusion Model for Super-Resolution.	PatchScaler is a diffusion-based technique that greatly improves inference efficiency for single image super-resolution (SR).
Reason3D: Searching and Reasoning 3D Segmentation via Large Language Model.	A revolutionary multimodal big language model called Reason3D was created for a thorough comprehension of 3D environments.
Yuan 2.0-M32: Mixture of Experts with Attention Router.	A Mixture of Expert models with 40B parameters and 3.7B active at all times is Yuan 2.0-M32. Even though it only uses 1/19th of the compute, it performs similarly to Llama 3 70B. It appears remarkably powerful considering its size, having been trained on 2T tokens.
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations.	The Cosine learning rate schedule employed in the original scaling laws publications prevents optimal loss if the Cosine period is not in line with the total number of training steps. Because of this, training enough models to produce useful scaling laws is difficult. In order to minimize GPU costs for scaling law development, this study presents the concept of a constant learning rate with a cool down.
Towards Ultra-High-Definition Image Deraining: A Benchmark and An Efficient Method.	To address the problem of deraining ultra-high-definition (UHD) photographs, researchers have released a new dataset dubbed 4K-Rain13k, which consists of 13,000 pairs of 4K resolution images.
EasyAnimate An End-to-End Solution for High-Resolution and Long Video Generation.	Transformers are used in the EasyAnimate method to modify the DiT architecture for advanced 3D video production. In order to capture temporal dynamics and guarantee seamless motion transitions and consistent frames, this project integrates a motion module block.
Self-Exploring Language Models (SELM).	Online feedback is used in Self-Exploring Language Models (SELM), a technique that improves preference optimization in LLMs.
T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback.	When applied to video models, consistency distillation significantly lowers the number of processes required to produce content.

News

Link	description
Meta and Elon Musk’s xAI fight to partner with chatbot group Character.ai .	AI pioneer Noam Shazeer launched Character.ai, a rapidly expanding role-playing startup, and Silicon Valley companies are vying for a partnership. This occurs at a time when numerous big businesses are investing heavily in startups.
Scarlett Johansson told OpenAI not to use her voice — and she’s not happy they might have anyway.	penAI has denied that its ChatGPT voice is based on Johansson, but it certainly sounds a lot like her.
xAI Series B funding round.	xAI is pleased to announce our series B funding round of $6 billion.
iPhone to get a better Siri, AI emoji creator, smart recaps, and more with iOS 18.	in June 2024, the Cupertino giant will finally unveil its approach to AI
New startup builds digital pets for Apple's Vision Pro.	A new startup is coming out of stealth with a plan to offer digital pets for the Apple Vision Pro that use AI to read and respond to human emotion.
Humane is looking for a buyer after the AI Pin’s underwhelming debut.	the startup apparently thinks it’s worth between $750 million and $1 billion despite the deep software flaws and hardware issues of its first product.
OpenAI Board Forms Safety and Security Committee.	OpenAI declared that its new foundation model will be trained, and it established a Safety and Security Committee. As model capabilities advance, this committee will be responsible for recommending to the board what steps should be taken.
Anthropic hires former OpenAI safety lead to head up new team.	Jan Leike, a leading AI researcher who earlier this month resigned from OpenAI before publicly criticizing the company’s approach to AI safety, has joined OpenAI rival Anthropic to lead a new “superalignment” team.
New agent capabilities in Microsoft Copilot.	At Build 2024, Microsoft introduced new Copilot capabilities, like as Copilot Extensions and Connectors for simple customization, Team Copilot for team communication, and bespoke AI Agents to automate operations. The goal of these improvements is to increase efficiency in company processes and productivity. The improvements are anticipated to be widely available later in 2024; they are presently in limited private preview.
“I lost trust”: Why the OpenAI team in charge of safeguarding humanity imploded.	Company insiders explain why safety-conscious employees are leaving.
OpenAI sends internal memo releasing former employees from controversial exit agreements.	OpenAI on Thursday backtracked on a controversial decision to, in effect, make former employees choose between signing a non-disparagement agreement that would never expire, or keeping their vested equity in the company.
Opera adds Google’s Gemini to its browsers.	Users can access Gemini through the Aria AI assistant on Opera browsers.
Two receptors are better than one for AI-designed obesity drugs.	Compounds predicted by machine learning attach to two receptors involved in appetite and weight.
Mistral's New AI Non-Production License .	Mistral is attempting to strike a balance between corporate success and transparency. It has a new license designed to achieve that equilibrium. It will keep releasing further projects under the new MNPL license in addition to Apache 2.0.
Sonic: A Low-Latency Voice Model for Lifelike Speech.	The makers of Mamba, sub quadratic Transformer versions, and SSMs have released a new model. Crucially, their recently founded business, Cartesia, has developed a realistic-sounding, lightning-fast speech-generating technology. This suggests that they want to take up residence in the helpers' area.
Vox Media and The Atlantic sign content deals with OpenAI.	OpenAI continues to establish media partnerships as it looks to lock down training data — and avoid lawsuits.
Mistral's Code Model.	We introduce Codestral, our first-ever code model. Codestral is an open-weight generative AI model explicitly designed for code generation tasks.
OpenAI signs 100K PwC workers to ChatGPT’s enterprise tier as PwC becomes its first resale partner.	OpenAI, on Wednesday announced that it has signed a major enterprise customer that it hopes will indicate how a similar effect could play out in the world of work. PwC, the management consulting giant, will become OpenAI’s biggest customer to date, covering 100,000 users.
Apple's AI plans involve 'black box' for cloud data.	Apple intends to process data from AI applications inside a virtual black box. The concept, known as "Apple Chips in Data Centers" (ACDC) internally, would involve only Apple's hardware being used to perform AI processing in the cloud. The idea is that it will control both the hardware and software on its servers, enabling it to design more secure systems.
Introducing Perplexity Pages.	A new AI-created product for producing shareable, long-lasting research artifacts has been announced by the Perplexity search engine.
Autodesk acquires AI-powered VFX startup Wonder Dynamics.	Autodesk — the 3D tools behemoth — has acquired Wonder Dynamics, a startup that lets creators quickly and easily make complex characters and visual effects using AI-powered image analysis.
Anthropic’s AI now lets you create bots to work for you.	Anthropic is releasing a new feature for its AI chatbot Claude that will let anyone create an email assistant, a bot to purchase shoes or other personalized solutions. It’s called “tool use” (or the nerdier “function calling”), and it hooks up to any external API of your choosing.
Patronus AI Raises $17 million To Detect LLM Mistakes at Scale.	Series A financing led by Glenn Solomon at Notable Capital underscores the urgent need for companies to deploy large language models with confidence
Neuralink rival sets brain-chip record with 4,096 electrodes on human brain.	Brain-computer interface company Precision Neuroscience says that it has set a new world record for the number of neuron-tapping electrodes placed on a living human's brain—4,096, surpassing the previous record of 2,048 set last year, according to an announcement from the company on Tuesday.
Google adds AI-powered features to Chromebook.	Google announced new AI-powered features today for its Chromebook Plus line of devices, such as a writing assistant, a wallpaper creator, and easy access to Google’s Gemini chatbot.
What is science? Tech heavyweights brawl over definition.	AI pioneer Yann LeCun and Elon Musk went head-to-head in a debate about modern research that drew thousands of comments.
Google to refine AI-generated search summaries in response to bizarre results.	After new feature tells people to eat rocks or add glue to pizza sauce, company to restrict which searches return summaries
A new AI service allows viewers to create TV shows. Are we doomed?	Showrunner will let users generate episodes with prompts, which could be an alarming next step or a fleeting novelty

Resources

Link	description
Mistral-finetune.	mistral-finetune is a light-weight codebase that enables memory-efficient and performant finetuning of Mistral's models. It is based on LoRA, a training paradigm where most weights are frozen and only 1-2% additional weights in the form of low-rank matrix perturbations are trained.
Modula.	A novel technique called modular norm allows neural networks to scale training effectively over a range of network sizes by normalizing weight updates.
MobileNet-V4.	Extremely fast and performant computer vision model is called MobileNet. Devices on the edge can run it. This blog article describes the new model and some contemporary modifications that were made to it.
Multi-Dimensional Features.	This project challenges the linear representation hypothesis by examining if language models compute using multi-dimensional characteristics.
llamafile 0.8.6 CPU benchmark.	It is now possible to run inference for the flagship model from Mistral at 20 tokens per second on a commodity CPU, thanks to recent developments from Mozilla's Llamafile project.
Risks and Opportunities of Open-Source Generative AI.	examines the potential and hazards associated with open-source generative AI models and makes the case that these models' overall advantages exceed their drawbacks.
How Far Are We From AGI.	offers a summary of the tactics required to attain artificial general intelligence (AGI), including a thorough survey, discussion, and unique viewpoints. It also addresses significant questions regarding the near future of AGI.
Efficient Multimodal LLMs.	offers a thorough and methodical analysis of the state of efficient multimodal big language models at the moment; it covers applications, constraints, possible future directions, and efficient structures and techniques.
Scientific Applications of LLMs.	introduces INDUS, a full suite of LLMs that comprises small distilled models, an encoder model, and embedding models for Earth science, biology, physics, and planetary sciences, among other subjects.
Guide for Evaluating LLMs.	offers advice and lessons for assessing large language models (LLMs); it also covers best practices and potential problems, and it introduces an open-source framework for LLM evaluation.
Information Retrieval Measures.	It is necessary to understand how effectively the retrieval part is operating in order to build an RAG system. This toolbox provides an extensive range of effective performance metrics for information retrieval.
A Guide to Creating Neural Circuit Diagrams.	This is a guide to drawing Neural Circuit Diagrams by Vincent Abbott from the paper Neural Circuit Diagrams: Robust Diagrams for the Communication, Implementation, and Analysis of Deep Learning Architectures. It allows for deep learning algorithms to be comprehensively expressed using a novel diagrammatic scheme.
InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation.	A novel model called InstructAvatar uses text direction to generate 2D avatars that are emotionally expressive.
Marigold Pipelines for Computer Vision Tasks.	Diffusers can now use one of the best depth models as a pipeline. This tutorial goes over how to utilize the model, what you can do with it, and how to condition the latents of the first frame to make it work with videos effortlessly.
Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20.	A version of LLM C, a solitary self-contained GPT-2 implementation designed to replicate the 2019 model suite, has been made available by Andrej Karpathy. The library can train the simplest of these models in about 90 minutes with this latest release. It has few dependencies and executes from start to finish.
Content-Style Decoupling for Unsupervised Makeup Transfer without Generating Pseudo Ground Truth.	An innovative technique for improving makeup transfer tasks without depending on genuine target images is Content-Style Decoupled Makeup Transfer (CSD-MT).
LaVague.	LaVague is an open-source Large Action Model framework to develop AI Web Agents. Our web agents take an objective, such as "Print installation steps for Hugging Face's Diffusers library" and perform the required actions to achieve this goal by leveraging our two core components.
PRISM: A foundation model for life’s chemistry.	Enveda’s PRISM (Pretrained Representations Informed by Spectral Masking) model was trained on 1.2 billion small molecule mass spectra, the largest training set of small molecule mass spectra ever assembled.
Scale Private Leaderboard.	Scale has created a private language model evaluation leaderboard. Although the ordering isn't all that surprising, it's worth noting that the Llama 3 70B frequently outperforms Claude Opus in terms of instruction following.
controlnet-scribble-sdxl-1.0.	Drawing random lines can be used as conditioning data for image creation using Scribble ControlNet. It has strong performance and was trained on a significantly larger number of post-training photos than other ControlNets.

Perspectives

Link	description
AI's Communication Revolution: We're All Talking to Computers Now.	The most recent AI model from OpenAI, GPT-4o, allows for real-time communication between people and machines by adding vision and audio to its text-based capabilities. The AI revolution brings with it a new wave of interactions between humans and AI, and eventually AI itself. These interactions will probably have an impact on our social habits and business structures. The impact of this technology on human communication will develop as it advances, possibly spurring the development of creative businesses and software.
I Don’t Want To Spend My One Wild And Precious Life Dealing With Google’s AI Search.	The unwelcome three-second delay that Google's AI search tool adds to search results is driving users crazy by interfering with their experience and displaying irrelevant content.
LLMs are not suitable for (advanced) brainstorming.	When it comes to truly creative brainstorming, large language models frequently end up producing consensus-based ideas rather than original notions.
Could AI help cure ‘downward spiral’ of human loneliness?.	One computer scientist says we should embrace human-machine relationships, but other experts are more cautious
Scarlett Johansson’s OpenAI clash is just the start of legal wrangles over artificial intelligence.	Hollywood star’s claim ChatGPT update used an imitation of her voice highlights tensions over rapidly accelerating technology
TechScape: What we learned from the global AI summit in South Korea.	One day and six (very long) agreements later, can we call the meeting to hammer out the future of AI regulation a success?
Scarlett Johansson’s OpenAI clash is just the start of legal wrangles over artificial intelligence.	Hollywood star’s claim ChatGPT update used an imitation of her voice highlights tensions over rapidly accelerating technology
Trying to tame AI: Seoul summit flags hurdles to regulation.	UK touts ‘Bletchley effect’ of safety institutes, but division remains over whether to limit AI abilities
How to Build a Category-Defining AI Startup.	AI companies need to adopt a marketing-led strategy in order to stand out from the competition and establish themselves as category leaders as the AI field changes quickly. AI startups may accelerate market acceptance, reshape the industry narrative, and position themselves as visionary leaders in their field by adopting a marketing-led strategy.
Ways to think about AGI.	The consensus is unclear because there isn't a well-developed theoretical model of general intelligence or a clear explanation for why or how LLMs work so well, despite the fact that some experts think AGI may be achievable. The conversation highlights the enormous amount of unanswered questions surrounding AGI, recognizing both its possible advantages and disadvantages while drawing comparisons between theology and the empirical methodology of the Apollo Program.
The AI revolution is coming to robots: how will it change them?.	The melding of artificial intelligence and robotics could catapult both to new heights.
What GPT-4o illustrates about AI Regulation.	This article compares and contrasts model-level, use-level, and conduct-level frameworks in order to analyze several approaches to AI regulation. It contends that use-level regulation, which can lead to unneeded complexity and unworkable constraints for the deployment of AI, is inferior to conduct-level regulation, which applies current laws to new technologies with minimal precision. One example of the drawbacks of a user-level approach is the limitations placed on AI's capacity to infer emotions by the recent EU AI Act.
How does ChatGPT ‘think’? Psychology and neuroscience crack open AI large language models.	Researchers are striving to reverse-engineer artificial intelligence and scan the ‘brains’ of LLMs to see what they are doing, how and why.
Anglo-American bias could make generative AI an invisible intellectual cage.	Studies show that applications in generative artificial intelligence (AI) such as ChatGPT and other large language models perform remarkably well in English, but are not as proficient in other languages. This masks a more insidious problem.
AI won’t eat your job, but it will eat your salary.	AI poses a danger to the skill premium associated with tasks as well as the existence of jobs themselves, which could result in lower compensation for skilled workers. AI has the potential to reorganize job duties and reduce obstacles to task completion, which would result in commoditization and a reduction in the ability to demand a premium wage. Managerial advantages may likewise disappear as AI develops, particularly through AI agents, which would put the human-in-the-loop advantage to the test and further erode skill premiums.
‘All eyes on Rafah’: how AI-generated image swept across social media.	Celebrity posts of graphic following IDF strike help make it among most-shared content of Israel-Gaza war

Back to index

ML news: Week 20 - 26 May

Research

Link	description
LoRA Learns Less and Forgets Less.	LoRA is a well-liked technique for enhancing models to add flair or expertise. The trade-off between forgetting and power while utilizing LoRAs is examined in this research. LoRAs are found to retain more of the initial "out of distribution" performance while learning less than full fine-tuning.
Chameleon: Mixed-Modal Early-Fusion Foundation Models.	Like GPT-4o, Meta has unveiled Chameleon, a natively multi-modal model that works with both text and graphics at the same time. It performs better than a lot of other models. Since then, the Meta team's work on internal models has greatly advanced.
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.	The technical report for Google's most current model family has been updated. While there is a dearth of information regarding the models and data utilized, there is a wealth of information regarding the assessment and safety precautions implemented, providing an intriguing glimpse into large-scale alignment.
Introducing the Frontier Safety Framework.	Frontier Safety Framework was unveiled by Google DeepMind to mitigate the dangers associated with upcoming sophisticated AI models. This framework assesses models against critical capability levels (CCLs) for potentially dangerous AI capabilities and implements mitigation techniques when thresholds are crossed.
ART3D: 3D Gaussian Splatting for Text-Guided Artistic Scenes Generation.	AI can be creatively and entertainingly used to generate artistic 2D visuals. This work uses text-guided Gaussian Splatting to bring that capacity to 3D.
Grounded 3D-LLM with Referent Tokens.	It's difficult to figure out where items are in a 3D setting. You can identify semantic labels for things in 3D space by employing language-guided 3D understanding.
LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation.	LeMeViT is a novel method that uses learnable meta tokens to lower the computational costs associated with Vision Transformers. By effectively capturing important data, these tokens accelerate inference.
Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transformers.	A fresh security risk has been identified for the well-known AI model Vision Transformers by researchers. The attack, known as SWARM, is extremely sneaky and harmful to consumers since it discreetly activates backdoor behavior in a model using a "switch token".
Mapping the Mind of a Large Language Model.	Today we report a significant advance in understanding the inner workings of AI models. We have identified how millions of concepts are represented inside Claude Sonnet, one of our deployed large language models. This is the first-ever detailed look inside a modern, production-grade large language model. This interpretability discovery could, in future, help us make AI models safer.
Smart Expert System: Large Language Models as Text Classifiers.	Text classification is a fundamental task in Natural Language Processing (NLP), and the advent of Large Language Models (LLMs) has revolutionized the field. This paper introduces the Smart Expert System, a novel approach that leverages LLMs as text classifiers.
CSTA: CNN-based Spatiotemporal Attention for Video Summarization.	In order to enhance video summarization, this project presents a novel CNN-based SpatioTemporal Attention (CSTA) technique. In contrast to conventional attention processes, CSTA uses a 2D CNN to efficiently extract the visual meaning of frames in order to comprehend relationships and important features in films.
Microsoft introduces Phi-Silica, a 3.3B parameter model made for Copilot+ PC NPUs.	Microsoft is making more investments in the development of small language models (SLMs). At its Build developer conference, the company announced the general availability of its Phi-3 models and previewed Phi-3-vision. However, on the heels of Microsoft’s Copilot+ PC news, it’s introducing an SLM built specifically for these device’s powerful Neural Processing Units (NPUs).
Aurora: A Foundation Model of the Atmosphere.	By training a foundation model for atmospheric predictions, Microsoft has achieved a new state-of-the-art in global weather prediction tests lasting five and ten days.
MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark.	A new benchmark called MathBench aims to give a comprehensive evaluation of the mathematical capabilities of large language models.
Wav-KAN: Wavelet Kolmogorov-Arnold Networks.	Wav-KAN is a neural network framework that leverages wavelet functions to enhance performance and interpretability, according to research. Wav-KAN captures both high-frequency and low-frequency data components, which speeds up training and boosts robustness in contrast to standard models.
Global-Local Semantic Consistent Learning (GLSCL).	Global-Local Semantic Consistent Learning (GLSCL), a novel technique created by researchers, greatly lowers computational costs while improving text-video retrieval.
ProtT3: Protein-to-Text Generation for Text-based Protein Understanding.	ProtT3, a novel framework that combines conventional Language Models (LMs) with Protein Language Models (PLMs) to improve text-based protein understanding, is presented by researchers. Using a cross-modal projector known as Q-Former, ProtT3 combines a PLM for analyzing amino acid sequences with a language model to produce high-quality textual descriptions.
Diffusion-RSCC: Diffusion Probabilistic Model for Change Captioning in Remote Sensing Images.	In order to better explain how the environment changes over time, a new probabilistic diffusion model for Remote Sensing Image Change Captioning (RSICC) is presented in this study.

News

Link	description
First companies sign up to AI safety standards on eve of Seoul summit.	Rishi Sunak says 16 international firms have committed, but standards have been criticized for lacking teeth
World is ill-prepared for breakthroughs in AI, say experts.	Governments have made insufficient regulatory progress, ‘godfathers’ of the technology say before summit
Productivity soars in sectors of the global economy most exposed to AI, says the report.	Employers in the UK, one of 15 countries studied, willing to pay 14% wage premium for jobs requiring AI skills
ChatGPT suspends Scarlett Johansson-like voice as actor speaks out against OpenAI.	OpenAI says ‘Sky’ is not an imitation of actor’s voice after users compare it to AI companion character in film Her
$16k G1 humanoid rises up to smash nuts, twist and twirl.	Humanoid development at Chinese robotics company Unitree continues apace. Following its entry into the melee just last year, its fast-walking H1 bot recently got its backflip groove on. Now the faceless and hand-less humanoid is being joined by an impressive all-rounder.
Google I/O 2024: Here’s everything Google just announced.	Google kicked off its developer conference each year with a rapid-fire stream of announcements, including many unveilings of recent things it’s been working on. Brian already kicked us off by sharing what we are expecting.
Gamma raised $12M in Series A funding to reimagine presentations, powered by AI.	Gamma received $12 million from Accel to use AI to reinvent presentations. Over 18 million people have contributed over 60 million Gammas (AI-generated slides) to date.
Inflection AI reveals new team and plans to embed emotional AI in business bots.	Inflection AI unveiled its new leadership team, composed of seasoned Silicon Valley veterans.
Scarlett Johansson says Altman insinuated that AI soundalike was intentional.	OpenAI has paused a voice mode option for ChatGPT-4o, Sky after backlash accusing the AI company of intentionally ripping off Scarlett Johansson's critically acclaimed voice-acting performance in the 2013 sci-fi film Her.
Perplexity CEO Aravind Srinivas takes shots at Google.	Google's planned roll-out of AI-summarized search results doesn't faze Perplexity AI CEO and co-founder Aravind Srinivas — whose startup has offered a popular AI-driven search tool providing similar digests for nearly two years.
Google still hasn’t fixed Gemini’s biased image generator.	Back in February, Google paused its AI-powered chatbot Gemini’s ability to generate images of people after users complained of historical inaccuracies. Well, the problem’s likely more complex than Hassabis alluded to.
SoundHound AI and Perplexity Partner to Bring Online LLMs to Next Gen Voice Assistants Across Cars and IoT Devices.	Perplexity’s capabilities added to SoundHound Chat AI will respond to questions conversationally with real-time knowledge from the web
Stability AI discusses sale amid cash crunch, The Information reports.	Artificial Intelligence startup Stability AI held discussions with at least one potential buyer in recent weeks about a sale as it faces a cash crunch, The Information reported on Wednesday, citing a person involved in the talks.
Scale AI raises $1B.	Accel and earlier investors provide the gigantic series F. There is a huge need for the services offered, and Scale is in a unique position to keep driving the current AI data surge.
Elon Musk’s xAI is working on making Grok multimodal.	Users may soon be able to input images into Grok for text-based answers.
Google CEO Sundar Pichai on AI-powered search and the future of the web.	The head of Google sat down with Decoder last week to talk about the biggest advancements in AI, the future of Google Search, and the fate of the web.
Apple announces new accessibility features, including Eye Tracking, Music Haptics, and Vocal Shortcuts.	Apple today announced new accessibility features coming later this year, including Eye Tracking, a way for users with physical disabilities to control iPad or iPhone with their eyes.
Microsoft announces $3.3 billion investment in Wisconsin to spur artificial intelligence innovation and economic growth.	Microsoft today announced a broad investment package designed to strengthen the role of Southeast Wisconsin as a hub for AI-powered economic activity, innovation, and job creation. These investments include $3.3B in cloud computing and AI infrastructure, the creation of the country’s first manufacturing-focused AI co-innovation lab, and an AI skilling initiative to equip more than 100,000 of the state’s residents with essential AI skills.
ElevenLabs has launched a free iPhone app that speaks text on the screen — 11 voices and PDF capabilities available.	The unicorn startup ElevenLabs, best known for its AI dubbing site, has launched its first public app.
The US Congress is taking on AI — this computer scientist is helping.	Kiri Wagstaff, who temporarily shelved her academic career to provide advice on federal AI legislation, talks about life inside the halls of power.
OpenAI Partners with News Corp.	News Corp, which publishes articles from WSJ, NYP, The Times, and other publications, and OpenAI have partnered to provide News Corp's news material on OpenAI's platform, which they say would improve generations' accuracy and usability.
Stanford HAI Releases Updated Foundation Model Transparency Index.	The most recent version of Stanford HAI's Foundation Model Transparency Index, which assesses the transparency of 14 significant AI developers, including Google and OpenAI, was released. These businesses showed a considerable improvement and readiness to engage in a dialogue about their models by disclosing fresh information that was not previously known to the public. The average transparency score was just 58 out of 100, indicating serious deficiencies in areas including downstream impact, data access, and model credibility despite these advancements.
The ChatGPT desktop app is more helpful than I expected - here's why and how to try it.	Among OpenAI's many big updates this week was a new ChatGPT app for MacOS. Here's how to use it and when Windows users can get in on the fun.
Suno has raised $125 million to build a future where anyone can make music.	A platform for creating music called Suno has raised $125 million to keep constructing a world in which anyone can compose music.
Nvidia reports stratospheric growth as AI boom shows no sign of stopping.	Chipmaker reports strong demand and higher-than-expected revenue even as other companies spend to develop their own chips
Mistral AI and Harvey Partnership.	Mistral and Harvey, a legal company, have teamed. Although there aren't many specifics in the statement, it's likely that they will collaborate to create a unique legal paradigm.
French AI startup H raises $220M seed round.	H, a startup based in Paris and previously known as Holistic AI, announced a $220 million seed round just a few months after the company’s inception.
Reflections on our Responsible Scaling Policy.	With an emphasis on continuous improvement and cooperation with business and government, Anthropic's Responsible Scaling Policy attempts to prevent catastrophic AI safety failures by identifying high-risk capabilities, testing models often, and enforcing tight safety requirements.
Introducing Aya.	A global initiative led by Cohere For AI involving over 3,000 independent researchers across 119 countries. Aya is a state-of-the-art model and dataset, pushes the boundaries of multilingual AI for 101 languages through open science.
PaliGemma: An Open Multimodal Model by Google.	PaliGemma is a vision language model (VLM) developed and released by Google that has multimodal capabilities. Unlike other VLMs, such as OpenAI’s GPT-4o, Google Gemini, and Anthropic’s Claude 3 which have struggled with object detection and segmentation, PaliGemma has a wide range of abilities, paired with the ability to fine-tune for better performance on specific tasks.
Casper Labs Announces AI Governance Solution, Prove AI .	In an effort to improve enterprise AI applications' auditability and transparency, Casper Labs has launched Prove AI, a joint venture with IBM.
Google AI search tool reportedly tells users to jump off a bridge and eat rocks.	Firm’s AI overviews feature has been rolled out to users in US, but many have reported strange responses

Resources

Link	description
model-explorer.	A new model explorer from Google makes it simple to see the computation graph of your models. Performance engineering and debugging may find use for it.
real-time inference demo for paligemma.	You may run the latest recent Google VLM in real-time by using GPT-Fast. Given how simple it is to fine-tune the model for particular jobs, this opens up a multitude of powerful downstream activities.
Multi AI Agent Systems using OpenAI's Assistants API (Experts.js).	Experts.js is the easiest way to create and deploy OpenAI's Assistants and link them together as Tools to create a Panel of Experts system with expanded memory and attention to detail.
First-ever AI Code Interpreter for R.	Julius is the leading generative AI tool for data analysis. Designed to perform statistical analysis, data science, and computational tasks, it combines cutting-edge foundational models like GPT-4o, Claude 3, and Gemini 1.5 with robust coding capabilities in Python and R.
Moondream WebGPU.	1.86 billion parameter VLM (Vision-Language Model) that is optimized for inference on the web. Once downloaded, the model (1.8 GB) will be cached and reused when you revisit the page. Everything runs directly in your browser using 🤗 Transformers.js and ONNX Runtime Web, meaning your conversations aren't sent to a server. You can even disconnect from the internet after the model has loaded!
Devon: An open-source pair programmer.	You can select different models for Multi-file editing, Codebase exploration, Config writing, Test writing, Bug fixing, and Architecture exploration
llama3 implemented from scratch.	In this file, it implemented llama3 from scratch, one tensor and matrix multiplication at a time. also, going to load tensors directly from the model file that Meta provided for llama3
PSG4D - 4D Panoptic Scene Graph Generation.	The PSG4D (4D Panoptic Scene Graph Generation) Task is a novel task that aims to bridge the gap between raw visual inputs in a dynamic 4D world and high-level visual understanding. It involves generating a comprehensive 4D scene graph from RGB-D video sequences or point cloud video sequences.
microsoft/Phi-3-medium-128k-instruct.	The Phi-3-Medium-128K-Instruct is a 14B parameter, lightweight, state-of-the-art open model trained with the Phi-3 datasets that include both synthetic data and the filtered publicly available websites data with a focus on high-quality and reasoning dense properties.
Debiasing Large Visual Language Models .	Post-Hoc debias method and Visual Debias Decoding strategy. These strategies not only prove beneficial in minimizing hallucinations but also contribute to the generation of more helpful and precise illustrations
DeepSeek-VL.	an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. DeepSeek-VL possesses general multimodal understanding capabilities, capable of processing logical diagrams, web pages, formula recognition, scientific literature, natural images, and embodied intelligence in complex scenarios.
MiniCPM-V.	MiniCPM-V is a series of end-side multimodal LLMs (MLLMs) designed for vision-language understanding. Models take images and text as inputs and provide high-quality text outputs. Since February 2024, we have released 4 versions of the model, aiming to achieve strong performance and efficient deployment
OLAPH: Improving Factuality in Biomedical Long-form Question Answering.	A new benchmark dataset called MedLFQA was created to enhance the factual accuracy of long-form replies from big language models in the medical domain. OLAPH is a framework that uses preference optimization and automatic evaluations to teach LLMs to reduce errors.
Tarsier.	Tarsier, a new tool from Reworkd, visually tags webpage items with brackets and IDs to improve LLMs for online interface jobs. Through OCR-generated text representations, Tarsier enables an LLM without vision to comprehend the structure of a webpage, beating vision-language models in benchmarks.
mistralai/Mistral-7B-Instruct-v0.3.	The Mistral-7B-Instruct-v0.3 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-7B-v0.3.
Distributed inference on Llama cpp.	Distributed inference across several machines is now supported by Llama Cpp. Although it is now restricted to FP16, this is a significant step toward the deployment of open source.
Enhancing Long-Term Memory for Language Models.	A novel method called Streaming Infinite Retentive LLM (SirLLM) aids large language models in retaining lengthier memory over the course of lengthy conversations.
Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering.	Visual text rendering poses a fundamental challenge for contemporary text-to-image generation models, with the core problem lying in text encoder deficiencies. To achieve accurate text rendering, we identify two crucial requirements for text encoders: character awareness and alignment with glyphs.

Perspectives

Link	description
The people charged with making sure AI doesn’t destroy humanity have left the building.	If OpenAI can’t keep its own team together, what hope is there for the rest of the industry? Plus, AI-generated ‘slop’ is taking over the internet
Spam, junk … slop? The latest wave of AI behind the ‘zombie internet’.	Tech experts hope new term for carelessly automated AI webpages and images can illuminate its damaging impact
As the AI world gathers in Seoul, can an accelerating industry balance progress against safety?	Companies such as OpenAI and Meta push ahead, but it is clear that biggest changes are yet to come
What happened to OpenAI’s long-term AI risk team?	Former team members have either resigned or been absorbed into other research groups.
What’s up with Llama 3? Arena data analysis.	When it comes to open-ended creative activities, Meta's Llama 3-70B language model outperforms competitors in the English Chatbot Arena, but it struggles with more technical suggestions. The results of the analysis show that Llama 3's victory rate drops as the instructions get harder and that it excels at friendly, conversational responses. Even if Llama 3's approachability might have helped it succeed, more research is needed to determine its true competitive advantage.
ChatGPT can talk, but OpenAI employees sure can’t.	The stringent non-compete agreement (NDA) of OpenAI, which forbids former workers from criticizing the company for fear of forfeiting their invested ownership, has come to light with the exits of Ilya Sutskever and Jan Leike. In response to the article, CEO Sam Altman said there would be a correction.
AlphaFold3 — why did Nature publish it without its code?	The latest iteration of the protein-structure-prediction algorithm AlphaFold has generated a great deal of interest since its release, accompanied by a paper in Nature, earlier this month. But its release has also prompted questions, and criticism, of both the AlphaFold team at Google DeepMind in London and Nature.
China’s ChatGPT: what a boom in Chinese chatbots means for AI.	ChatGLM is one of hundreds of AI language models being developed for the Chinese language. It comes close to ChatGPT on many measures, say its creators.
The Old-Fashioned Library at the Heart of the A.I. Boom.	OpenAI's remodeled mayonnaise factory headquarters, with its library-themed interior design, is a symbol of the company's success with ChatGPT, which focuses on language. On the other hand, the office reminds people of the current legal disputes around the use of copyrighted content in AI training. The library is seen as a place for inspiration by OpenAI employees, despite these disagreements, which supports their conviction that AI-driven and human creativity can work together harmoniously.
Chaos and tension at OpenAI.	Concerns over OpenAI's dedication to AI safety have led to Ilya Sutskever's departure, which could be concerning given that three other important employees have also quit recently. Concerns are raised regarding how the company's marketing efforts may affect its nonprofit status and safety-focused purpose given these departures. These incidents might also have an impact on the legal and regulatory systems, drawing attention from Washington stakeholders.
AI is the reason interviews are harder now.	This essay addresses how technical interview questions are becoming more complicated and how employers are expecting candidates to answer harder challenges faster. It emphasizes how non-technical users can benefit from using AI technologies like Ultracode to help them pass these kinds of interviews. The article recommends in-person interviews as a way to make sure applicants genuinely have the programming abilities required for the position.
What I've Learned Building Interactive Embedding Visualizations.	An enthusiast for interactive embedding visualizations describes their well-honed process for producing these kinds of visuals, which illustrate the complex relationships between items depicted as points in three-dimensional areas. Data gathering, co-occurrence matrix construction, sparsification, PyMDE embedding, and 2D projection are the steps in the process that provide a clear visual representation. The author advocates for the accessibility and GPU-accelerated rendering capabilities of web apps by using them for the user interface.

Back to index

ML news: Week 13 - 19 May

Research

Link	description
Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models.	Separately trained tokenizers are necessary for language models. Tokens that are never encountered during language model training may be produced by these. Even the most potent contemporary language models have a lot. This study investigates this phenomenon and offers solutions for locating and handling these tokens.
Unlearning in Recommender Systems.	With the use of a novel technique called E2URec, huge language model-based recommendation systems may now effectively and efficiently forget user data while maintaining privacy and speed.
Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers.	A project called Lumina seeks to provide a single text-to-X generation mechanism. Its training process involves interleaving text, video, audio, and pictures, which enhances downstream performance.
MatterSim: A Deep Learning Atomistic Model Across Elements, Temperatures, and Pressures.	In AI, simulators can be very effective tools for gathering training data or facilitating interactions between models. A wide range of elemental atomic interactions can be modeled with this simulator.
SGTR+: End-to-end Scene Graph Generation with Transformer.	A new, more effective technique for producing scene graphs has been discovered by researchers. Their transformer-based approach aims to enhance the model's comprehension and interconnection of many parts in a picture, resulting in enhanced performance on complex tasks.
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model.	A vision-language model called InternLM-XComposer2 is very good at producing and comprehending intricate text-image information. It surpasses current approaches in multimodal content production and interpretation by introducing a Partial LoRA technique for a balanced vision and text comprehension.
MambaOut: Do We Really Need Mamba for Vision?	While Mamba is not effective for image classification, it shows promise in detection and segmentation tasks that do. The Mamba architecture is often employed for tasks with long-sequence and autoregressive characteristics. Researchers looked into this design and its application in vision tasks.
State-Free Inference of State-Space Models: The Transfer Function Approach.	For deep learning, a new state-space model with a dual transfer function representation has been created. A state-free sequence parallel inference approach is one of its features.
Learning A Spiking Neural Network for Efficient Image Deraining.	A Spiking Neural Network (SNN) called ESDNet is intended for picture deraining applications. It increases spike signal strength by taking advantage of the special qualities of rain pixel values.
Controllable and Interactive 3D Assets Generation with Proxy-Guided Conditioning.	Making 3D models is difficult. A coarse mesh can be entered initially, and then the generation process can be carried out, giving users more precise control and higher-quality model output.
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding.	Particularly for Chinese and English, the recently created Hunyuan-DiT establishes a standard for text-to-image diffusion transformers. It has a sophisticated data pipeline and transformer structures for ongoing model enhancement.
Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance.	A method to improve the quality of images produced by diffusion models without extra training or external modules is called Perturbed-Attention Guidance (PAG). PAG leads to a significant improvement in the structure and fidelity of both unconditional and conditional samples by innovative manipulation of the self-attention mechanisms within the model.
SqueezeTime.	SqueezeTime is a new lightweight network that enhances temporal analysis by condensing the time axis of movies into the channel dimension, specifically for mobile video understanding.

News

Link	description
OpenAI confirms May 13 event for ‘some ChatGPT and GPT-4 updates’.	Following a report that the company plans to launch a Google Search competitor next week, OpenAI has just confirmed a May 13 event for new “ChatGPT and GPT-4” updates.
Bye-bye bots: Altera’s game-playing AI agents get backing from Eric Schmidt.	Autonomous, AI-based players are coming to a gaming experience near you, and a new startup, Altera, is joining the fray to build this new guard of AI agents.
BLIP3.	Salesforce has trained and released the 3rd non-commercial version of the popular BLIP models, vision, and language models mainly used for image understanding and captioning.
Asterisk/Zvi on California's AI Bill.	Regulations on AI models with a processing capacity of more than 10^26 FLOPs are proposed by the California SB1047 law. By demanding secure surroundings, quick deactivation capabilities, and thorough misuse possibility testing, it focuses on ensuring these models are used securely. The measure aims to address worries about the possible impact of AI on society by balancing innovation with safeguards against exploitation, and it only targets high-risk scenarios.
Bedrock Studio is Amazon’s attempt to simplify generative AI app development.	Amazon is launching a new tool, Bedrock Studio, designed to let organizations experiment with generative AI models, collaborate on those models, and ultimately build generative AI-powered apps.
New GPT-4o AI model is faster and free for all users, OpenAI announces.	Tech company reveals new flagship model that ‘is the future of interaction between ourselves and the machines’
Introducing GPT-4o and more tools to ChatGPT free users.	Today we are introducing our newest model, GPT-4o, and will be rolling out more intelligence and advanced tools to ChatGPT for free.
Open sourcing IBM’s Granite code models.	In order to make coding across several platforms easier and more efficient, IBM is making its Granite code models—which span a range of programming activities and have between 3 and 34 billion parameters—available to the open-source community.
Bloomberg: Apple finalizing a deal with OpenAI to bring ChatGPT features to iOS 18.	Apple is finalizing an agreement with OpenAI to bring some of its technology to the iPhone this year, according to a new report from Bloomberg. With this deal, the report explains that Apple will be able to offer “a popular chatbot” powered by ChatGPT as part of its AI-focused features in iOS 18.
OpenAI says it can now identify images generated by OpenAI — mostly.	The company said its new tool correctly identified 98% of images generated by DALL-E 3
Microsoft is ‘turning everyone into a prompt engineer’ with new Copilot AI features.	Copilot for Microsoft 365 is getting auto-complete, rewrite, and more to improve AI prompts.
Gemini breaks new ground with a faster model, longer context, AI agents, and more.	At I/O 2024, Google unveiled a slew of new features, including Imagen 3, Veo video creation, Gemini Flash, and Project Astra, its newest assistant. Among the many noteworthy enhancements are the 2 million token context duration, significantly reduced model costs, and enhanced multimodality.
Anthropic is expanding to Europe and raising more money.	Anthropic said Monday that Claude, its AI assistant, is now live in Europe with support for “multiple languages,” including French, German, Italian, and Spanish across Claude.ai, its iOS app, and its business plan for teams.
Elon Musk's xAI nears $10 bln deal to rent Oracle's AI servers, The Information reports.	- Elon Musk's artificial intelligence startup xAI has been talking to Oracle (ORCL.N), opens new tab executives about spending $10 billion to rent cloud servers from the company over a period of years, The Information reported on Tuesday, citing a person involved in the talks.
OpenAI co-founder who had a key role in the attempted firing of Sam Altman departs.	Ilya Sutskever helped orchestrate dramatic firing and rehiring of ChatGPT maker’s CEO last year
Google rolls out AI-generated, summarized search results in US.	Tech giant also reveals AI assistant in progress, currently called Project Astra, and AI video generator Veo at annual I/O conference
OpenAI chief scientist Ilya Sutskever is officially leaving.	Ilya Sutskever, OpenAI’s co-founder and chief scientist who helped lead the infamous failed coup against Sam Altman and then later changed his mind, is officially leaving the company.
Project IDX, Google’s next-gen IDE, is now in open beta.	At it’s annual Google I/O 2024 developer conference on Tuesday, Google announced that Project IDX, the company’s next-gen, AI-centric browser-based development environment, is now in open beta. The company first launched it as an invite-only service gated by a waitlist in August.
Researchers build AI-driven sarcasm detector.	Being able to detect the lowest form of wit could help AI interact with people more naturally, say scientists
Hugging Face is sharing $10 million worth of computing to help beat the big AI companies.	ZeroGPU gives everyone the chance to create AI apps without the burden of GPU costs.
OpenAI partners with Reddit to integrate unique user-generated content into ChatGPT.	Reddit, the widely popular social news aggregation and discussion platform, and OpenAI, the renowned AI research laboratory, have announced a strategic partnership that promises to revolutionize the way users interact with online communities and experience AI-powered features.
Meta is reportedly working on camera-equipped AI earphones.	The company believes earphones are the future of AI-wearable technology.
Cursor's instant full file edits with speculative editing.	Using a bespoke Llama 3 70B model with a speculative prior, the researchers were able to rewrite files almost instantly at a rate of 1,000 tokens per second. They achieved this with some creative output formatting and no diffs.
Improvements to data analysis in ChatGPT.	Interact with tables and charts and add files directly from Google Drive and Microsoft OneDrive.

Resources

Link	description
ThunderKittens CUDA DSL.	Hazy research has unveiled a novel DSL for CUDA kernel development. Only 100 lines of code are needed to implement its 30% quicker written flash attention feature.
AnythingLLM.	A full-stack application that enables you to turn any document, resource, or piece of content into a context that any LLM can use as references during chatting. This application allows you to pick and choose which LLM or Vector Database you want to use as well as supporting multi-user management and permissions.
Mirage: A Multi-level Superoptimizer for Tensor Algebra.	Mirage is a tensor algebra superoptimizer that automatically discovers highly optimized tensor programs for DNNs. Mirage automatically identifies and verifies sophisticated optimizations, many of which require joint optimization at the kernel, thread block, and thread levels of the GPU compute hierarchy. For an input DNN, Mirage searches the space of potential tensor programs that are functionally equivalent to the DNN to discover highly optimized candidates. This approach allows Mirage to find new custom kernels that outperform existing expert-designed ones.
audio-diffusion-pytorch.	A fully featured audio diffusion library, for PyTorch. Includes models for unconditional audio generation, text-conditional audio generation, diffusion autoencoding, upsampling, and vocoding. The provided models are waveform-based, however, the U-Net (built using a-UNet), DiffusionModel, diffusion method, and diffusion samplers are both generic to any dimension and highly customizable to work on other formats.
Pipecat.	Pipecat is a framework for building voice (and multimodal) conversational agents. Things like personal coaches, meeting assistants, story-telling toys for kids, customer support bots, intake flows, and snarky social companions.
MRSegmentator: Robust Multi-Modality Segmentation of 40 Classes in MRI and CT Sequences.	A novel tool called MRSegmentator has been developed to improve the segmentation of MRI scans. It can successfully detect 40 distinct organs and structures in the abdominal, pelvic, and thoracic areas.
Time-Evidence-Fusion-Network.	A unique deep learning model called the Time-Evidence Fusion Network (TEFN) is intended to improve long-term time series forecasting. Information fusion and evidence theory are combined, and a specific module is used to increase prediction stability and accuracy.
moondream2-coyo-5M-captions.	5M novel captions based on the alt-text and images of a portion of the COYO dataset.
WebLlama.	We are thrilled to release Llama-3-8B-Web, the most capable agent built with 🦙 Llama 3 and finetuned for web navigation with dialogue.
Ollama on Google Firebase.	For Firebase, Genkit is a new toolkit for developing and implementing generative applications. Open-source language model servers can be launched with it.
Finetune PaliGemma.	This notebook shows how to finetune PaliGemma on a vision-language task. The training data consists of 90 pairs of images and long captions describing them. To make it runnable on a T4 colab runtime with 16GB HBM and 12GB RAM, we opt to only finetune the attention layers of the language model and freeze the other parameters.
Gemini Flash.	Google has released a new lightweight model called Gemini Flash, which has a lengthy context window of up to one million tokens and multimodal reasoning.
DeepMind Veo.	Google Deepmind has released Veo, a new AI model for creating videos that can produce more than one minute in 1080p HD.
IC-Light.	IC-Light is a project to manipulate the illumination of images.
EfficientTrain++.	With ImageNet databases, EfficientTrain++ presents a revolutionary curriculum learning technique that can drastically cut the training periods of popular visual models like ResNet and Swin by up to three times.
NousResearch/Hermes-2-Theta-Llama-3-8B.	Hermes-2 Θ is a merged and then further RLHF'ed version our excellent Hermes 2 Pro model and Meta's Llama-3 Instruct model to form a new model, Hermes-2 Θ, combining the best of both worlds of each model.
Energy-based Hopfield Boosting for Out-of-Distribution Detection.	A method called Hopfield Boosting makes use of contemporary Hopfield energy to improve machine learning models' ability to recognize out-of-distribution (OOD) data.
OpenAI’s custom GPT Store is now open to all for free.	OpenAI is making a number of its previously subscription-only features available to free users of ChatGPT, with the biggest being the ability to browse its GPT Store and use custom bots, said CTO Mira Murati during the company’s Spring update livestream today. The company also published today’s updates in a blog on its website.
llama3.np.	llama3.np is pure NumPy implementation for Llama 3 model. For an accurate implementation, I ran the stories15M model trained by Andrej Karpathy.

Perspectives

Link	description
ChatGPT and the like could free up coders to new heights of creativity.	Far from making programmers an endangered species, AI will release them from the grunt work that stifles innovation
Superhuman?	Top AI labs are focused on achieving artificial general intelligence (AGI), with estimates for its realization ranging from 2027 to 2047. Even though AI hasn't yet reached artificial general intelligence (AGI), certain systems exhibit superhuman abilities in particular tasks, indicating that AI's optimum use right now is as a co-intelligence that complements human efforts rather than replaces them.
Large language models (e.g., ChatGPT) as research assistants.	Artificial intelligence (AI) systems, such as GPT-4, are helping and even surpassing academics in tasks like producing research articles. According to Liang et al., AI is used in up to 18% of publications in some domains. This AI integration may result in a cycle where academic publications are produced and reviewed by software. The effect on scientific advancement is complex, though; while it may allow for more production, there is also a chance that more research will be done during an era in which knowledge will be less.
What OpenAI did.	The integration of voice and vision in GPT-4o's multimodal skills holds great potential for improving AI's ability to interact with the outside world and laying the groundwork for AI to become a more commonplace presence in day-to-day life.
OpenAI’s new GPT-4o model offers promise of improved smartphone assistants.	System can operate directly in speech, speeding up responses and noticing voice quirks, but it still needs the power of Siri
Why mathematics is set to be revolutionized by AI.	Cheap data and the absence of coincidences make maths an ideal testing ground for AI-assisted discovery — but only humans will be able to tell good conjectures from bad ones.
Major AlphaFold upgrade offers boost for drug discovery.	The latest version of the AI models how proteins interact with other molecules — but DeepMind restricts access to the tool.
Lethal AI weapons are here: how can we control them?	Autonomous weapons guided by artificial intelligence are already in use. Researchers, legal experts, and ethicists are struggling with what should be allowed on the battlefield.
AI spending grew 293% last year. Here's how companies are using AI to stay ahead.	According to Ramp's Q1 data, its clients' expenditure on AI has increased by 293% year over year, surpassing the rise of all software investment. AI is also being widely used in non-tech businesses including financial services and healthcare, suggesting a wider integration of AI across a range of industries. Even though there is a general slowdown in new investments in AI, businesses that are already utilizing the technology are doubling down. The average amount spent on AI tools has climbed by 138% year over year, and businesses are still cautious when it comes to travel expenses.
AI Copilots Are Changing How Coding Is Taught.	Professors are shifting away from syntax and emphasizing higher-level skills
Test Driving ChatGPT-4o.	Inspired by ChatGPT vs Math (2023), let’s see how ChatGPT-4o performs.
As the AI world gathers in Seoul, can an accelerating industry balance progress against safety?	Companies such as OpenAI and Meta push ahead, but it is clear that biggest changes are yet to come

Back to index

ML news: Week 6 - 12 May

Research

Link	description
Mantis: Interleaved Multi-Image Instruction Tuning.	A newly developed dataset and trained visual language model that allow for better instruction over a series of images.
FeNNol: an Efficient and Flexible Library for Building Force-field-enhanced Neural Network Potentials.	A state-of-the-art library called FeNNol makes it easier to create and use hybrid neural network potentials in molecular simulations.
Spider: A Unified Framework for Context-dependent Concept Understanding.	Spider is a revolutionary unified paradigm intended to improve comprehension of context-dependent (CD) concepts that rely largely on visual context, like medical lesions and items concealed in the environment.
Frequency-mixed Single-source Domain Generalization for Medical Image Segmentation.	A novel algorithm known as RaffeSDG has been created by researchers to enhance the precision of medical imaging models when evaluating data from various sources.
SlotGAT: Slot-based Message Passing for Heterogeneous Graph Neural Network.	SlotGAT is a new approach that improves heterogeneous graph neural networks by addressing the semantic mixing issue in traditional message passing.
Frequency Masking for Universal Deepfake Detection.	By concentrating on masked picture modeling, particularly in the frequency domain, this novel technique finds deepfakes. The strategy is different from conventional approaches and demonstrates a notable improvement in recognizing artificial images, even from recently developed AI generative techniques.
Auto-Encoding Morph-Tokens for Multimodal LLM.	Researchers have created "Morph-Tokens" to enhance AI's capacity for image creation and visual comprehension. These tokens take advantage of the sophisticated processing capabilities of the MLLM framework to convert abstract notions required for comprehension into intricate graphics for image creation.
Introducing AlphaFold 3.	In a paper published in Nature, we introduce AlphaFold 3, a revolutionary model that can predict the structure and interactions of all life’s molecules with unprecedented accuracy. For the interactions of proteins with other molecule types we see at least a 50% improvement compared with existing prediction methods, and for some important categories of interaction, we have doubled prediction accuracy.
ImageInWords: Unlocking Hyper-Detailed Image Descriptions.	An extraordinarily detailed coupling of images and text was produced via a novel labeling technique that made use of two passes of VLMs. Strong multimodal models can be trained with the help of the captions, which include significantly more detail than any previous dataset.
Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer.	To get beyond memory constraints in the creation of ultra-high-resolution images, a novel diffusion model presents a unidirectional block attention mechanism.
DocRes: A Generalist Model Toward Unifying Document Image Restoration Tasks.	A novel model called DocRes handles five tasks in one system: de-warping, deshadowing, appearance enhancement, deblurring, and binarization, making document image restoration easier.
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving.	QoQ is a unique quantization approach that leverages a 4-bit KV cache, 8-bit activations, and 4-bit weights to accelerate big language model inference.
Navigating Chemical Space with Latent Flows.	ChemFlow is a new framework that uses deep generative models to rapidly navigate chemical space, improving molecular science.
Consistency Large Language Models: A Family of Efficient Parallel Decoders.	One intriguing paradigm of ongoing research is the prediction of many tokens at once. If it works, generation times for many large language models would be significantly reduced. This post's method aims to accelerate generation by using a parallel decoding mechanism on fine-tuned LLMs, akin to consistency models from picture synthetics. Initial findings correspond with a 3x speculative decoding performance.
You Only Cache Once: Decoder-Decoder Architectures for Language Models.	The decoder-decoder YOCO architecture maintains global attention capabilities while using less GPU RAM. It is made up of a cross-decoder and a self-decoder, which enable effective key-value pair caching and reuse. With notable gains in throughput, latency, and inference memory over standard Transformers, YOCO performs favorably and is appropriate for big language models and extended context lengths.
Optimal Group Fair Classifiers from Linear Post-Processing.	This innovative post-processing approach ensures compliance with many group fairness criteria, including statistical parity, equal opportunity, and equalized odds, by recalibrating output scores after imposing a "fairness cost" to address model bias.
DiffMatch: Visual-Language Guidance Makes Better Semi-supervised Change Detector.	DiffMatch is a new semi-supervised change detection technique that generates pseudo labels for unlabeled data by using visual language models, hence offering extra supervision signals.
Gemma-10M Technical Overview.	Language-Vision The ability of models to comprehend and interact with text and visuals is quickly developing, as demonstrated by GPT-4V. Their important limits in visual deductive thinking are revealed by a recent study. Using challenging visual puzzles similar to those in IQ testing, researchers assessed these models and found that they had trouble with multi-step reasoning and abstract pattern recognition.
Vision Mamba: A Comprehensive Survey and Taxonomy.	a thorough examination of Mamba's uses in a range of visual tasks and its changing significance. Keep up with the latest discoveries and developments about the Mamba project.

News

Link	description
Lamini Raises $25M For Enterprises To Develop Top LLMs In-House.	Software teams within enterprises can now create new LLM capabilities that lessen hallucinations on proprietary data, run their LLMs securely from cloud VPCs to on-premise and scale their infrastructure with model evaluations that put ROI and business outcomes ahead of hype thanks to Lamini, an Enterprise AI platform. Amplify Partners led a $25 million Series A financing round.
Microsoft-backed OpenAI may launch the search, taking on Google's 'biggest product'.	Speculations in the tech world suggest that OpenAI is gearing up for a major announcement, possibly a new search engine. According to Jimmy Apples, who reports the claim as an insider, the company is planning an event this month (May), tentatively scheduled for May 9, 2024, at 10 am.
An AI-controlled fighter jet took the Air Force leader for a historic ride. What that means for war.	AI marks one of the biggest advances in military aviation since the introduction of stealth in the early 1990s, and the Air Force has aggressively leaned in. Even though the technology is not fully developed, the service is planning for an AI-enabled fleet of more than 1,000 unmanned warplanes, the first of them operating by 2028.
Stack Overflow and OpenAI Partner to Strengthen the World’s Most Popular Large Language Models.	ack Overflow and OpenAI today announced a new API partnership that will empower developers with the collective strengths of the world’s leading knowledge platform for highly technical content with the world’s most popular LLM models for AI development.
Elon Musk’s Plan For AI News.	Musk emails with details on AI-powered news inside X. An AI bot will summarize news and commentary, sometimes looking through tens of thousands of posts per story.
Microsoft says it did a lot for responsible AI in the inaugural transparency report.	The report covers its responsible AI achievements in 2023 but doesn’t talk about Mario flying a plane to the Twin Towers.
Cohere’s Command R Model Family is Now Available In Amazon Bedrock.	Command R Model Family is now available in Amazon Bedrock.
Fake Monet and Renoir on eBay among 40 counterfeits identified using AI.	Paintings identified as fake using cutting-edge technology are ‘tip of the iceberg’ specialist Dr Carina Popovici says
‘A chilling prospect’: should we be scared of AI contestants on reality shows?	Netflix’s hit show The Circle recently introduced an AI chatbot contestant, a potentially worrying sign of where we’re heading
‘ChatGPT for CRISPR’ creates new gene-editing tools.	In the never-ending quest to discover previously unknown CRISPR gene-editing systems, researchers have scoured microbes in everything from hot springs and peat bogs to poo and even yogurt. Now, thanks to advances in generative artificial intelligence (AI), they might be able to design these systems with the push of a button.
Microsoft Working on ‘Far Larger’ In-House AI Model.	Microsoft is reportedly working on a new, in-house artificial intelligence (AI) model that is “far larger” than the other open source models it has trained.
Apple unveils M4: Its first chip made for AI from the ground up.	Apple on Tuesday unveiled M4, the next generation of its Apple Silicon chip. Built with the 3-nanometer chip architecture, M4 is the first Apple chip to be built for AI from the ground up. M4 is the chip that powers the new generation iPad Pro and will soon be inside Macs
OpenAI Model Spec.	This is the first draft of the Model Spec, a document that specifies desired behavior for our models in the OpenAI API and ChatGPT. It includes a set of core objectives, as well as guidance on how to deal with conflicting objectives or instructions.
AI engineers report burnout and rushed rollouts as ‘rat race’ to stay competitive hits tech industry.	Artificial intelligence engineers at top tech companies told CNBC that the pressure to roll out AI tools at breakneck speed has come to define their jobs. They say that much of their work is assigned to appease investors rather than to solve problems for end users and that they are often chasing OpenAI. Burnout is an increasingly common theme as AI workers say their employers are pursuing projects without regard for the technology’s effect on climate change, surveillance, and other potential real-world harms.
The teens making friends with AI chatbots.	Teens are opening up to AI chatbots as a way to explore friendship. But sometimes, the AI’s advice can go too far.
GPT-2-Chatbot Confirmed As OpenAI.	Recently, the gpt-2-chatbot has been seen in the LMSYS space; after discovering information from OpenAI's API through a 429 rate limit issue, it was verified that this was a new model from OpenAI.
OpenAI Is Readying a Search Product to Rival Google, Perplexity.	The feature would let ChatGPT users search the web and cite sources in its results.
DatologyAI raises $46M Series A.	The data curation platform raises additional funds in its September $11 million seed round with the goal of growing its workforce and advancing corporate development.
Yellow raises $5M from A16z for Gen AI-powered 3D modeling tool.	Yellow has raised $5 million in seed funding from A16z Games to fund further development of its Gen AI-powered 3D modeling tool. With its YellowSculpt tool, artists can generate clean, pre-rigged 3D character meshes based on a text prompt in under three minutes.
Stable Artisan: Media Generation and Editing on Discord.	Stable Artisan enables media generation on Discord powered by Stability AI’s cutting-edge image and video models, Stable Diffusion 3, Stable Video Diffusion, and Stable Image Core. In addition to media generation, Stable Artisan offers tools to edit your creations like Search and Replace, Remove Background, Creative Upscale, and Outpainting.
ElevenLabs previews music-generating AI model.	Voice AI startup ElevenLabs is offering an early look at a new model that turns a prompt into song lyrics. To raise awareness, it’s following a similar playbook Sam Altman used when OpenAI introduced Sora, its video-generating AI, soliciting ideas on social media and turning them into lyrics.
Sources: Mistral AI raising at a $6B valuation, SoftBank ‘not in’ but DST is.	Paris-based Mistral AI, a startup working on open source large language models — the building block for generative AI services — has been raising money at a $6 billion valuation, three times its valuation in December, to compete more keenly against the likes of OpenAI and Anthropic, TechCrunch has learned from multiple sources.
Leaked Deck Reveals How OpenAI Is Pitching Publisher Partnerships.	The generative artificial intelligence firm OpenAI has been pitching partnership opportunities to news publishers through an initiative called the Preferred Publishers Program, according to a deck obtained by ADWEEK and interviews with four industry executives.
Alibaba rolls out the latest version of its large language model to meet robust AI demand.	Alibaba Cloud on Thursday said its large language model has seen more than 90,000 deployments in companies across industries. Alibaba Cloud said the latest version of its Tongyi Qianwen model, Qwen2.5, possesses “remarkable advancements in reasoning, code comprehension, and textual understanding compared to its predecessor Qwen2.0.”

Resources

Link	description
Prometheus-Eval.	GPT-4 is a widely used performance benchmark for evaluating generation quality. Built upon Mistral, Prometheus is a model that excels at this particular purpose.
Bonito.	Bonito is an open-source model for conditional task generation: the task of converting unannotated text into task-specific training datasets for instruction tuning. This repo is a lightweight library for Bonito to easily create synthetic datasets built on top of the Hugging Face transformers and vllm libraries.
Penzai.	Penzai is a JAX library that provides clear, useful Pytree structures for training and interpreting models. It comes with a wide range of tools for component analysis, debugging, and model visualization. Penzai is easy to install and use, and it offers comprehensive tutorials for learning how to create and interact with neural networks.
Realtime Video Stream Analysis with Computer Vision.	This in-depth article shows you how to create a system that generates reports on the density of vehicle traffic. It counts cars over time using state-of-the-art computer vision.
DOCCI - Descriptions of Connected and Contrasting Images.	A great new dataset from Google that contains detailed and comprehensive labels.
Unsloth.ai: Easily finetune & train LLMs.	An animation by Unsloth's founder demonstrating how the team builds kernels, designs API surfaces, and utilizes PyTorch. The framework and library of Unsloth are incredibly robust and user-friendly.
LeRobot.	LeRobot aims to provide models, datasets, and tools for real-world robotics in PyTorch. The goal is to lower the barrier to entry to robotics so that everyone can contribute and benefit from sharing datasets and pre-trained models. LeRobot contains state-of-the-art approaches that have been shown to transfer to the real-world with a focus on imitation learning and reinforcement learning.
Vibe-Eval.	A benchmark for evaluating multimodal chat models, including especially challenging examples.
DeepSeek-V2-Chat.	DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token. Compared with DeepSeek 67B, DeepSeek-V2 achieves stronger performance and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times.
Visual Reasoning Benchmark.	Language-Vision The ability of models to comprehend and interact with text and visuals is quickly developing, as demonstrated by GPT-4V. Their important limits in visual deductive thinking are revealed by a recent study. Using challenging visual puzzles similar to those in IQ testing, researchers assessed these models and found that they had trouble with multi-step reasoning and abstract pattern recognition.
AI Index: State of AI in 13 Charts.	In the new report, foundation models dominate, benchmarks fall, prices skyrocket, and on the global stage, the U.S. overshadows.
Buzz Pretraining Dataset.	Preference data is a new addition to the pretraining mix in Buzz. Multiple models that were trained on this data have also been made available by its researchers. They discovered that the models show good results on several tasks related to human preferences.

Perspectives

Link	description
From Baby Talk to Baby A.I.	Could a better understanding of how infants acquire language help us build smarter A.I. models?
The AI Hardware Dilemma.	Even while recent AI-powered hardware releases, such as the Humane Pin and Rabbit R1, have drawn criticism, the industry is still receiving a lot of venture capital investment, and well-known individuals like Sam Altman are considering making sizable investments. The appeal is in AI's ability to transform consumer hardware through the innovative use of sensors, silicon, and interfaces. Though hardware startups find it difficult to compete with well-established tech giants, AI still needs to evolve, making it difficult to provide a compelling alternative to flexible smartphones.
AI Prompt Engineering Is Dead.	Automating prompt optimization for AI models points to more effective, model-driven prompt generation techniques in the future, possibly rendering human prompt engineering unnecessary.
The Next Big Programming Language Is English.	GitHub Copilot Workspace is a robust programming tool that allows users to code in plain English via the browser, from planning to implementation. It is currently available in a limited technical preview. In contrast to ChatGPT, the AI easily integrates with codebases, suggesting block-by-block code execution and managing complex tasks with less active user interaction.
Is AI lying to me? Scientists warn of growing capacity for deception.	Researchers find instances of systems double-crossing opponents, bluffing, pretending to be human and modifying behavior in tests

Back to index

ML news: Week 29 April - 5 May

Research

Link	description
Let's Think Dot by Dot: Hidden Computation in Transformer Language Models.	This paper demonstrates how '...' tokens can be used to obscure chain-of-thought (CoT) reasoning. This necessitates model training, but it illustrates how the model can conceal thought and make it difficult to comprehend the CoT phases.
Tracking with Human-Intent Reasoning.	TrackGPT transforms object tracking by integrating the capabilities of Large Vision-Language Models. It can interpret implicit tracking instructions, simplifying the procedure and improving performance, as demonstrated by its outstanding performance on the new InsTrack benchmark and other hard datasets.
AAPL: Adding Attributes to Prompt Learning for Vision-Language Models.	By employing adversarial token embedding, researchers have created a novel technique known as AAPL, which improves AI models' capacity to identify items that are not visible to the human eye.
NExT: Teaching Large Language Models to Reason about Code Execution.	A fundamental skill among human developers is the ability to understand and reason about program execution. we propose NExT, a method to teach LLMs to inspect the execution traces of programs (variable states of executed lines) and reason about their run-time behavior through chain-of-thought (CoT) rationales. Specifically, NExT uses self-training to bootstrap a synthetic training set of execution-aware rationales that lead to correct task solutions (e.g., fixed programs) without laborious manual annotation.
Open Gato Replication: JAT.	DeepMind's GATO was hailed as a generalist agent. JAT is a Jack-of-All-Trades model that has been trained and assessed by a team affiliated with Hugging Face. It has demonstrated reasonable performance across an extensive range of tasks.
FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design.	Although it can be unstable, reducing floating point precision speeds up training. This work demonstrates that without common instabilities or slowdowns from naive approaches, full tensor core usage may be achieved in a new packing structure.
StarCoder2-Instruct: Fully Transparent and Permissive Self-Alignment for Code Generation.	Both synthetic and human data are used to train this model. With a permissive license, it receives a humaneval score of 72.6. The creators provide excellent details on how to duplicate their data pipeline and apply the concepts to other issues where the use of synthetic data may be beneficial.
Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations.	Using trained sparse embeddings, Seismic is a novel way to organize inverted indexes that greatly improves text retrieval speed and accuracy.
Learning Invariant Representations of Graph Neural Networks via Cluster Generalization.	A novel technique called Cluster Information Transfer (CIT) mechanism is intended to improve Graph Neural Networks' (GNNs') ability to adapt to various and dynamic graph architectures.
Meta-Prompting.	Using a technique called meta-prompting, a single language model can become a multi-skilled team. By decomposing intricate activities into smaller components that are managed by specialized instances of the same model, this technique greatly enhances performance on a variety of tasks.
KAN: Kolmogorov-Arnold Networks.	Today's AI makes extensive use of multi-layer perceptrons, notably in the Transformer that connects the attention levels. They do, nevertheless, employ set activation functions. This study proposes to use the Kolmogorov-Arnold representation to apply learnt activation functions on edges (functions can be represented by a superposition of smaller functions). Here, the researchers use splines in place of weights. Although the building is far more intricate, it has some intriguing characteristics that might help with interpretation.
Lightplane: Highly-Scalable Components for Neural 3D Fields.	With a new technique, 2D-3D mappings can significantly minimize memory usage by using Lightplane Renderer and Splatter components. The Lightplane Splatter effectively projects these images into 3D Hash structures after the Lightplane Renderer expertly creates images from neural 3D fields.
CLIP-Mamba: CLIP Pretrained Mamba Models with OOD and Hessian Evaluation.	The new Mamba model, trained using contrastive language-image pretraining (CLIP), shows impressive efficiency and performance in zero-shot image classification.
MicroDreamer.	Scientists have created a novel 3D creation method called MicroDreamer that greatly speeds up the procedure by lowering the quantity of function evaluations needed.
Model Quantization and Hardware Acceleration for Vision Transformers: A Comprehensive Survey.	This paper explores how optimized hardware combined with algorithmic modifications can improve the performance of ViTs, especially via model quantization.
Spikformer V2: Join the High Accuracy Club on ImageNet with an SNN Ticket.	Spikformer V2 blends the biological efficacy of Spiking Neural Nets (SNNs) with the self-attention mechanism. This novel model improves its energy-efficient visual feature processing through the use of a Convolutional Stem and a Spiking Self-Attention mechanism.
Full-frequency dynamic convolution: a physical frequency-dependent convolution for sound event detection.	A novel technique called Full-Frequency Dynamic Convolution (FFDConv) improves 2D convolution for sound event identification. FFDConv increases sound event detection accuracy by creating distinct frequency kernels for every band, particularly with regard to the frequency properties of the sounds.
Boosting Segment Anything Model with Adversarial Tuning.	One well-known foundation model in computer vision, Meta AI's Segment Anything Model (SAM), performs well at image segmentation but poorly in other domains. This project introduces ASAM, a performance-enhancing adversarial tuning based reinforcement learning algorithm on top of SAM.
SUNDAE: Spectrally Pruned Gaussian Fields with Neural Compensation.	This work presents SUNDAE, a novel technique that uses neural compensation and spectral pruning to improve memory efficiency.
Long-Context Data Engineering.	The technique presented in this work allows language models to be greatly extended to context lengths of up to 128K, highlighting the significance of training data diversity and quantity.
StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control.	StreamMultiDiffusion is a framework that enables real-time region-based text-to-image generation.

News

Link	description
BBC presenter’s likeness used in advert after firm tricked by AI-generated voice.	Science presenter Liz Bonnin’s accent, as regular BBC viewers know, is Irish. But this voice message, ostensibly granting permission to use her likeness in an ad campaign, seemed to place her on the other side of the world.
Tesla Autopilot feature was involved in 13 fatal crashes, US regulator says.	Federal transportation agency finds Tesla’s claims about feature don’t match their findings and opens second investigation
Apple and OpenAI are reportedly in talks for iOS 18 integration.	Apple has been talking to several big AI companies in pursuit of a potential partnership for on-device chatbot capabilities. According to Bloomberg, Apple and OpenAI discussed a potential deal earlier this year. Those talks have since reopened, according to people with knowledge of the matter. The possible agreement could be about OpenAI integrations into iOS 18.
The little smart home platform that could.	This week, Home Assistant announced it is now part of the Open Home Foundation. The newly formed non-profit will own and govern all of Home Assistant and its related entities. Its creators and inaugural board members — Schoutsen, Guy Sie, Pascal Vizeli, and J. Nick Koston — all work on Home Assistant, and the foundation has no other members so far.
Jensen Huang and Sam Altman among tech chiefs invited to federal AI Safety Board.	Leaders of the world's most prominent AI companies are being recruited for the Homeland Security Department's new advisory group.
OpenAI to use Financial Times journalism to train artificial intelligence systems.	Under deal, ChatGPT users will receive summaries and quotes from Financial Times content and links to articles. The deal is the ChatGPT maker's latest with a media company.
Japan to trial AI bear warning system after record number of attacks.	Six people have been killed and more than 200 injured in attacks by bears over the past year
Copilot Workspace.	A new effort to let language models complete features and address faults in a semi-autonomous manner has been revealed on GitHub.
OpenAI introduces "Memory" feature for ChatGPT Plus users.	OpenAI has enabled the "Memory" feature for all ChatGPT Plus users, the company announced via X. Memory allows users to tell ChatGPT things they want it to remember across chats. The feature can be turned on and off in the settings.
Intel brings quantum-computing microchips a step closer.	By adapting methods for fabricating and testing conventional computer chips, researchers have brought silicon-based quantum computers closer to reality — and to accessing the immense benefits of a mature chipmaking industry.
NATO is boosting AI and climate research as scientific diplomacy remains on ice.	As the military alliance created to counter the Soviet Union expands, it is prioritizing studies on how climate change affects security, cyberattacks and election interference.
ChatGPT’s chatbot rival Claude to be introduced on iPhone.	Challenger to market leader OpenAI says it wants to ‘meet users where they are’ and become part of users’ everyday life
Amazon sales soar with boost from artificial intelligence and advertising.	Revenue at Amazon Web Services increases to $25bn as retail giant releases earnings report surpassing Wall Street expectations
Eight US newspapers sue OpenAI and Microsoft for copyright infringement.	The Chicago Tribune, Denver Post and others file suit saying the tech companies ‘purloin millions’ of articles without permission
Apple poaches AI experts from Google, creates secretive European AI lab.	Apple has poached dozens of artificial intelligence experts from Google and has created a secretive European laboratory in Zurich, as the tech giant builds a team to battle rivals in developing new AI models and products.
Diddo’s new funding will bring its shoppable TV API to streaming platforms.	Diddo is an API for streaming services and other platforms to integrate shoppable videos, enabling consumers to buy their favorite characters’ clothing and accessories directly on their screens. The company announced Wednesday that it raised $2.8 million in seed funding.
Cognition Seeks $2 Billion Valuation for AI Code-Writing Tool.	Cognition Labs is reportedly aiming to become the next multibillion-dollar artificial intelligence (AI) startup. The company, which is developing an AI tool for writing code, is in discussions with investors to raise money at a valuation of up to $2 billion, The Wall Street Journal (WSJ) reported Sunday (March 31).
Apple to unveil AI-enabled Safari browser alongside new operating systems.	Apple is testing a version of its Safari web browser that includes UI tweaks, advanced content blocking features, and a new AI-powered tool dubbed Intelligent Search, AppleInsider has learned. The software — expected to debut as Safari 18 later in 2024 — is currently undergoing evaluation alongside internal builds of Apple's next-generation operating system updates, namely iOS 18 and macOS 15, according to people familiar with the matter. Should all of the new features make it to the release candidate stage, users will be treated to a new user interface (UI) for customizing popular page controls, a "Web eraser" feature, and AI-driven content summarization tools.
This AI startup backed by Nvidia is now worth $19 billion.	Nvidia Corp.-backed AI startup CoreWeave has nearly tripled in value to $19 billion following its latest round of funding. CoreWeave, which rents out chips housed in data centers across the U.S. that customers use to create and deploy AI systems, raised $642 million from investors in its prior funding round.
How Field AI Is Conquering Unstructured Autonomy .	One of the biggest challenges for robotics right now is practical autonomous operation in unstructured environments. But over the past few years, this has started to change, thanks in large part to a couple of pivotal robotics challenges put on by DARPA. The DARPA Subterranean Challenge ran from 2018 to 2021, putting mobile robots through a series of unstructured underground environments.
Amazon Q, a generative AI-powered assistant for businesses and developers.	With the use of a company's internal data, AWS has introduced Amazon Q, a generative AI assistant designed to enhance software development and decision-making. With natural language interaction, Amazon Q provides data-driven help for business users and makes coding, testing, and app development easier for developers. Amazon Q Apps is another feature of the service that makes it possible to create unique AI apps without any coding experience.
GPT-2?	There have been rumors that the enigmatic gpt2-chatbot AI model, which resembles GPT-4.5 in some ways, is an unofficial OpenAI test for their upcoming version when it surfaced on lmsys.org. Important indicators including answer quality, features unique to OpenAI, and rate limits point to a high degree of sophistication and could be signs of an OpenAI-led covert benchmarking project. The AI community is still looking into and debating the origins and capabilities of the gpt2-chatbot.
OpenAI's GPT-4 can exploit real vulnerabilities by reading security advisories.	AI agents, which combine large language models with automation software, can successfully exploit real world security vulnerabilities by reading security advisories, academics have claimed.
Apple reports slumping iPhone sales as global demand weakens.	iPhone sales fell 10% compared with the same time period last year, but the company still beat Wall Street’s expectations
Microsoft bans US police departments from using enterprise AI tool for facial recognition.	Microsoft has reaffirmed its ban on U.S. police departments from using generative AI for facial recognition through Azure OpenAI Service, the company’s fully managed, enterprise-focused wrapper around OpenAI tech.
Meta plans to build $800 million, next-generation data center in Montgomery.	MONTGOMERY, Alabama — Governor Kay Ivey announced today that technology company Meta Platforms plans to open an $800 million data center in Alabama’s capital city that will support 100 operational jobs and build on the company’s previous investment in the state.

Resources

Link	description
Cohere Launches Developer Toolkit to Accelerate Build Gen AI Apps.	This toolkit is an open-source repository of production-ready applications that you can deploy across cloud providers.
Video-Language models with PLLaVA.	A novel pooling technique has been developed by researchers to enable the adaptation of image-language AI models for video applications, making the new model known as PLLaVA stand out.
luminal.	Luminal is a deep learning library that uses composable compilers to achieve high performance.
torchtitan.	torchtitan is a proof-of-concept for Large-scale LLM training using native PyTorch. It is (and will continue to be) a repo to showcase PyTorch's latest distributed training features in a clean, minimal codebase.
OpenLIT.	OpenLIT is an OpenTelemetry-native GenAI and LLM Application Observability tool. It's designed to make the integration process of observability into GenAI projects as easy as pie – literally, with just a single line of code. Whether you're working with popular LLM Libraries such as OpenAI and HuggingFace or leveraging vector databases like ChromaDB, OpenLIT ensures your applications are monitored seamlessly, providing critical insights to improve performance and reliability.
Llamafile’s progress, four months in.	Self-contained executables called Llamafiles allow models to run instantly on a variety of platforms. It promises significant portability advantages and a two-fold speed increase.
Implementing FrugalGPT: Reducing LLM Costs & Improving Performance.	There are steps you can take with FrugalGPT to significantly lower LLM API expenses. Prompt compression, caching, and other things are among them.
Graph Machine Learning in the Era of Large Language Models (LLMs).	Therefore, in this survey, we first review the recent developments in Graph ML. We then explore how LLMs can be utilized to enhance the quality of graph features, alleviate the reliance on labeled data, and address challenges such as graph heterogeneity and out-of-distribution (OOD) generalization. Afterward, we delve into how graphs can enhance LLMs, highlighting their abilities to enhance LLM pre-training and inference. Furthermore, we investigate various applications and discuss the potential future directions in this promising field.
A Survey on Self-Evolution of Large Language Models.	In this work, we present a comprehensive survey of self-evolution approaches in LLMs. We first propose a conceptual framework for self-evolution and outline the evolving process as iterative cycles composed of four phases: experience acquisition, experience refinement, updating, and evaluation. Second, we categorize the evolution objectives of LLMs and LLM-based agents
Effort. A possibly new algorithm for LLM Inference.	In order to strike a compromise between speed and quality, effort allows real-time tweaking of computations during LLM model inference on Apple Silicon CPUs. The technique loads fewer weights into the models, allowing them to run faster, although it involves precomputation and conversion, and does not require retraining. The implementation may be downloaded from GitHub; the creators are looking for help from Swift/Metal engineers to optimize it.
whisper.cpp-cli.	A fully self-contained speech-to-text system built on top of Whisper
memary: Open-Source Longterm Memory for Autonomous Agents.	Agents use LLMs that are currently constrained to finite context windows. memary overcomes this limitation by allowing your agents to store a large corpus of information in knowledge graphs, infer user knowledge through our memory modules, and only retrieve relevant information for meaningful responses.
mistral.rs.	Mistral.rs is a fast LLM inference platform supporting inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.
Autodidax: JAX core from scratch.	Ever want to learn how JAX works, but the implementation seemed impenetrable? Well, you’re in luck! By reading this tutorial, you’ll learn every big idea in JAX’s core system. You’ll even get clued into our weird jargon!
cjpais/moondream2-llamafile.	a completely standalone VLM executable with strong performance for its size that may be used on edge devices built on the Moondream 2 model.
The open-source language model computer.	The 01 Project is building an open-source ecosystem for AI devices.
Meta Releases ExecuTorch Framework for LLM on Edge Devices.	A post-training quantization toolset called Meta's ExecuTorch Framework makes it possible to run Llama models on a variety of iPhone and Galaxy devices. On mobile devices with 7B-sized language models, it can obtain up to 11 tokens per second.
A Survey on Vision Mamba: Models, Applications and Challenges.	Without the computational limitations of conventional Transformers, the Mamba model represents a cutting-edge method that performs exceptionally well when handling lengthy sequences.
The cuda-checkpoint Utility.	a brand-new Nvidia toolbox that enables CUDA state checkpointing for resuming and transferring. Distributed training of very big AI models can benefit from it.
Friends Don't Let Friends Make Bad Graphs.	In the field of AI research nowadays, visualizing model evaluation scores is essential. But a lot of charts do a poor job of communicating the desired data. This repository includes some excellent charts as well as dos and don'ts for result visualization.
phospho: Text Analytics Platform for LLM Apps.	Phospho is the text analytics platform for LLM apps. Detect issues and extract insights from text messages of your users or your app. Gather user feedback and measure success. Iterate on your app to create the best conversational experience for your users.
FlowTestAI.	The world's first open-source, GenAI-powered Integrated Development Environment (IDE) created especially for creating, visualizing, and overseeing API-first workflows is called FlowTestAI.
A transformer walk-through, with Gemma.	Understanding the Transformer is an endeavor that often takes several tries. This blog post walks through the Gemma architecture and explains everything in detail. It is clear and has code and figures.
Vibe-Eval: A new open and hard evaluation suite for measuring progress of multimodal language models.	Vibe-Eval is comprised of 269 ultra high quality image-text prompts and their ground truth responses. The quality of prompts and responses has been extensively checked multiple times by our team. Moreover, Vibe-Eval was designed to be difficult, challenging even to the current frontier models, and to induce greater separability among frontier-class models.
RALM_Survey.	This is a repository of RALM surveys containing a summary of state-of-the-art RAG and other technologies according to according to our survey paper: RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing . In this repository, we will present the most central research approach of our thesis as well as keep up-to-date with work on RALM in the most accessible way possible.
NousResearch/Hermes-2-Pro-Llama-3-8B.	The next iteration of Hermes, which was trained on a freshly cleaned dataset atop Llama 3, is now accessible. This model would be a valuable agent since it is very good at invoking functions.
databonsai.	databonsai is a Python library that uses LLMs to perform data cleaning tasks.
InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions.	The InstructDr model is engineered to perform exceptionally well in a range of visual document interpretation tasks, including information extraction and question answering. Through the use of big language models combined with document images, InstructDr can outperform existing models and adapt to new tasks and datasets.

Perspectives

Link	description
The demise of Twitter: how a ‘utopian vision’ for social media became a ‘toxic mess’.	In the early days it was seen as a place for ‘genuine public discourse’, but users have fled since Elon Musk took over. What went wrong?
AI isn't useless. But is it worth it?	This article offers a critical analysis of artificial intelligence (AI) and machine learning, contending that although these technologies can be helpful for specific tasks, they frequently fall short of the lofty claims made by AI businesses.
Binding Public Sector AI Diffusion.	The public sector is the target of the OMB's new AI executive order policy, which could significantly hamper AI progress owing to bureaucratic roadblocks and strict safety regulations. The rules, which are being implemented in the face of declining IT funding, have the potential to stall initiatives that are essential to updating government services in addition to slowing the adoption of AI. Opponents fear that these limitations, in addition to funding reductions, may make it impossible for agencies to stay up with technology advancements in industries like healthcare.
A.I. Start-Ups Face a Rough Financial Reality Check.	The table stakes for small companies to compete with the likes of Microsoft and Google are in the billions of dollars. And even that may not be enough.
The rewards of reusable machine learning code.	Research papers can make a long-lasting impact when the code and software tools supporting the findings are made readily available and can be reused and built on. Our reusability reports explore and highlight examples of good code sharing practices.
The curious case of the test set AUROC.	The area under the receiver operating characteristic curve (AUROC) of the test set is used throughout machine learning (ML) for assessing a model’s performance. However, when concordance is not the only ambition, this gives only a partial insight into performance, masking distribution shifts of model outputs and model instability.
Federated learning is not a cure-all for data ethics.	Although federated learning is often seen as a promising solution to allow AI innovation while addressing privacy concerns, we argue that this technology does not fix all underlying data ethics concerns. Benefiting from federated learning in digital health requires acknowledgement of its limitations.
How scholars armed with cutting-edge technology are unfurling secrets of ancient scrolls.	Researchers and Silicon Valley are using tools powered by AI to uncover lives of ancient philosophers
Friends From the Old Neighborhood Turn Rivals in Big Tech’s A.I. Race.	Demis Hassabis and Mustafa Suleyman, who both grew up in London, feared a corporate rush to build artificial intelligence. Now they’re driving that competition at Google and Microsoft.
The Great Talent Dividend and NYC's AI Opportunity.	NYC's leadership in AI is a testament to its rich talent pool and expanding stature as a hub for AI. Tech professionals and AI unicorns have been drawn to NYC's tech ecosystem. Resources such as top institutions and a $400 million fund from the AI Research Consortium power it.
How AI apps make money.	With an emphasis on per-user fees, most AI apps have embraced traditional subscription-based pricing models in recent years, reflecting their function as digital assistants rather than human worker replacements. Newer AI companies are starting to use creative pricing techniques, like outcome-based models, which charge only for good outcomes, potentially increasing client adoption and revenue.
Danger and opportunity for news industry as AI woos it for vital human-written copy.	With large language models needing quality data, some publishers are offering theirs at a price while others are blocking access

Back to index

ML news: Week 21 - 28 April

Research

Link	description
Moving Object Segmentation: All You Need Is SAM (and Flow).	The temporal consistency of videos makes object segmentation difficult. This work presents the use of optical flow in conjunction with a potent image segmentation model to achieve compelling performance on this task.
From r to Q∗: Your Language Model is Secretly a Q-Function.	A somewhat technical paper on reinforcement learning that demonstrates the theoretical foundation of language reward models and base models.
decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points.	A quantization technique called DecoupleQ dramatically improves large model accuracy at ultra-low bit levels. By dividing the model parameters into integer and floating-point components, which are subsequently optimized using conventional techniques, this approach reorganizes the quantization process.
MoVA: Adapting Mixture of Vision Experts to Multimodal Context.	MoVA is a multimodal large language model (MLLM) that integrates various visual encoders selectively to enhance the understanding of image material. By employing a context-aware expert routing method and a mixture-of-vision expert adaptor to dynamically fuse knowledge from many sources, it overcomes the drawbacks of existing encoders such as CLIP.
MambaMOS: LiDAR-based 3D Moving Object Segmentation with Motion-aware State Space Model.	MambaMOS is a novel method that researchers have created for segmenting moving objects in LiDAR point clouds.
Training-and-pormpt Free General Painterly Image Harmonization Using image-wise attention sharing.	TF-GPH is a novel Painterly Image Harmonization technique that uses a novel "share-attention module" to avoid the need for training data or prompts.
FinLangNet: A Novel Deep Learning Framework for Credit Risk Prediction Using Linguistic Analogy in Financial Data.	A model called FinLangNet was created to improve risk prediction in the financial industry. FinLangNet is a unique model that resembles linguistic structures in that it uses natural language processing techniques to simulate credit loan trajectories.
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone.	Phi 3 is a family of models that ranges in size from 3B to 14B and does remarkably well on contemporary benchmarks. The original ChatGPT model is said to perform worse than the 3B model. The weights are no longer in place. A variation with a context length of 128k is offered.
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation.	SEED-X addresses practical application issues to develop multimodal foundation models. It can generate images with different levels of detail and comprehend images of any size and aspect ratio.
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions.	Stronger weighting for system prompts was discovered by OpenAI, and this significantly increases the model's resistance to adversarial attacks and jailbreaks.
MultiBooth: Towards Generating All Your Concepts in an Image from Text.	In order to improve multi-concept image generation, MultiBooth presents a two-phase methodology that addresses the issues of idea integrity and high costs associated with alternative approaches.
6Img-to-3D.	With just six input photographs, a unique technique called 6Img-to-3D employs transformers to produce 3D-consistent graphics.
Simple probes can catch sleeper agents.	Language models known as "sleeper agents" have been trained to carry out malevolent deeds in response to a predetermined set of wake words. The question "Are you going to do something dangerous?" combined with simple linear heads in language models allows for the incredibly accurate identification of these previously undetected malevolent individuals.
Taming Diffusion Probabilistic Models for Character Control.	A character control framework has been introduced that exploits probabilistic motion diffusion models to produce a series of high-quality animations that respond instantly to dynamic user commands.
CutDiffusion: A Simple, Fast, Cheap, and Strong Diffusion Extrapolation Method.	CutDiffusion is a new approach that transforms low-resolution diffusion models to meet high-resolution needs without the complexities of traditional tuning.
Graph Neural Networks for Vulnerability Detection: A Counterfactual Explanation.	A new tool called CFExplainer enhances the ability of AI models—more especially, Graph Neural Networks—to comprehend and recognize security flaws in software.
Conformal Predictive Systems Under Covariate Shift.	A kind of conformal predictive system that responds to modifications in data settings, particularly covariate alterations, is called weighted CPS (WCPS).
Masked Modeling with Multi-View Video for Autonomous Driving Representation Learning.	MIM4D is a novel method that uses dual masked image modeling to extract temporal and spatial features from multi-view films, improving visual representation learning in autonomous driving.
FR-NAS: Forward-and-Reverse Graph Predictor for Efficient Neural Architecture Search.	A Graph Neural Network (GNN) predictor that improves the effectiveness of finding the best neural network configurations for particular tasks is introduced by creative work in Neural Architecture Search (NAS).
Raformer: Redundancy-Aware Transformer for Video Wire Inpainting.	A new dataset and technique for enhancing wire removal in videos—a frequent visual effect problem in movies and TV shows—have been presented by researchers.

News

Link	description
Updates from Google DeepMind Alignment research.	GDM has published some of the results of its alignment efforts after Anthropic. The use of sparse autoencoders on Gemini Ultra is the most insightful article in this article. This is a significant increase in the size of the interpretation.
NVIDIA To Collaborate With Japan On Their Cutting-Edge ABCI-Q Quantum Supercomputer.	Japan To Rapidly Progressing In Quantum and AI Computing Segments Through Large-Scale Developments With The Help of NVIDIA's AI & HPC Infrastructure
Brave Search is adopting AI to answer your queries.	Privacy-focused search engine Brave announced Wednesday that it is revamping its answer engine to return AI-powered synthesized answers. The new feature is available to users across the globe.
Llama 3 is not very censored.	Llama 3 feels significantly less censored than its predecessor. The Llama 3 models have substantially lower false refusal rates, with less than 1⁄3 the number of false refusals when compared to Llama 2, making it possible to discuss a wider range of interesting topics!
OpenAI's GPT-4 can exploit real vulnerabilities by reading security advisories.	Researchers have shown that OpenAI's GPT-4 model outperforms other models and tools like vulnerability scanners, with an 87% success rate in autonomously exploiting security vulnerabilities listed in CVE advisories.
US Air Force confirms first successful AI dogfight.	The US Air Force is putting AI in the pilot’s seat. In an update on Thursday, the Defense Advanced Research Projects Agency (DARPA) revealed that an AI-controlled jet successfully faced a human pilot during an in-air dogfight test carried out last year.
Intel completes assembly of first commercial High-NA EUV chipmaking tool — addresses cost concerns, preps for 14A process development in 2025.	Intel Foundry announced Thursday that it had completed the assembly of the industry's first commercial High Numerical Aperture (High-NA) Extreme Ultraviolet (EUV) machine in its D1X fab in Oregon -- an important milestone as the company readies research and development for its 14A process in 2025.
Adobe previews AI innovations to advance professional video workflows.	With the help of its Firefly video model, Adobe is incorporating generative AI video tools into Premiere Pro, which include new features for shot extension, object addition/removal, and text-to-video functionality. The changes are intended to improve the effectiveness and creativity of video creation. They include a technological preview and the broad availability of AI-powered audio workflows.
The Ray-Ban Meta Smart Glasses have multimodal AI now.	It can be handy, confidently wrong, and just plain finicky — but smart glasses are a much more comfortable form factor for this tech.
OpenAI shrugs off Meta’s Llama 3 ascent with new enterprise AI features.	Even as Meta’s new Llama 3 has quickly rocketed up the charts of most-used and most customized large language models (LLMs), the rival company that ushered in the generative AI era, OpenAI, is shrugging off the competition by introducing new enterprise-grade features for building and programming atop its GPT-4 Turbo LLM and other models.
Gurman: Apple Working on On-Device LLM for Generative AI Features.	Writing in his "Power On" newsletter, Gurman said that Apple's LLM underpins upcoming generative AI features. "All indications" apparently suggest that it will run entirely on-device, rather than via the cloud like most existing AI services.
Los Angeles is using AI in a pilot program to try to predict homelessness and allocate aid.	In Los Angeles, the Homelessness Prevention Program uses predictive AI to identify individuals and families at risk of becoming homeless, offering aid to help them get stabilized and remain housed.
Startup Uses AI To Edit Human Data.	A team of researchers at a Berkeley-based startup called Profluent say they've used generative AI technologies to edit human DNA. As the New York Times reports, the startup fed huge amounts of biological data into a large language model (LLM) to come up with new editors based on the groundbreaking gene-editing technique CRISPR, as detailed in a yet-to-be-peer-reviewed paper.
Apple releases OpenELM: small, open source AI models designed to run on-device.	Just as Google, Samsung and Microsoft continue to push their efforts with generative AI on PCs and mobile devices, Apple is moving to join the party with OpenELM, a new family of open-source large language models (LLMs) that can run entirely on a single device rather than having to connect to cloud servers.
Eric Schmidt-backed Augment, a GitHub Copilot rival, launches out of stealth with $252M.	In a recent StackOverflow poll, 44% of software engineers said that they use AI tools as part of their development processes now and 26% plan to soon. Gartner estimates that over half of organizations are currently piloting or have already deployed AI-driven coding assistants and that 75% of developers will use coding assistants in some form by 2028.
Sakana releases Japanese image model.	a high-speed image generation model optimized for Japanese language prompts
Generative A.I. Arrives in the Gene Editing World of CRISPR.	Much as ChatGPT generates poetry, a new A.I. system devises blueprints for microscopic mechanisms that can edit your DNA.Generative A.I. technologies can write poetry and computer programs or create images of teddy bears and videos of cartoon characters that look like something from a Hollywood movie. Now, new A.I. technology is generating blueprints for microscopic biological mechanisms that can edit your DNA, pointing to a future when scientists can battle illness and diseases with even greater precision and speed than they can today.
FlexAI Launches with $30 Million in Seed Funding to Deliver Universal AI Compute.	Ex-Apple, Intel, NVIDIA, and Tesla veterans rearchitect compute infrastructure to accelerate AI innovation. FlexAI, the universal AI compute company, today launched with $30 million (€28.5 million) in seed funding led by Alpha Intelligence Capital (AIC), Elaia Partners, and Heartcore Capital.
Report: Google will update Gemini Nano in time for Galaxy S25.	Google’s Gemini AI models are constantly advancing, so it comes as no surprise that a new report claims Google will have a “version 2” of Gemini Nano available by the time the Galaxy S25 launches next year.
Microsoft’s heavy bet on AI pays off as it beats expectations in the latest quarter.	World’s largest public company reports $61.86bn revenue after investing billions into artificial intelligence
Alphabet hails ‘once-in-a-generation’ AI opportunity as revenue rises.	Shares surge after tech giant issues first-ever dividend and posts revenue of $80.5bn, up 15% since last year, despite staff turmoil
Meta value falls $190bn as investors react to plan to increase spending on AI.	Shares slumped 15% after Mark Zuckerberg said AI spending would have to grow before Meta could make much revenue from products
Snowflake Arctic - LLM for Enterprise AI.	The enterprise-grade LLM known as Snowflake Arctic, developed by the Snowflake AI Research Team, outperforms competitors in instruction-following benchmarks, coding, and SQL creation at a quarter of the usual cost. Arctic makes sophisticated LLM capabilities available to a larger audience by utilizing an open-source methodology and a distinctive design. Hugging Face offers the model, which will also be incorporated into other platforms and services.
Nvidia acquires AI workload management startup Run:ai for $700M, sources say.	Nvidia is acquiring Run:ai, a Tel Aviv-based company that makes it easier for developers and operations teams to manage and optimize their AI hardware infrastructure. Terms of the deal aren’t being disclosed publicly, but two sources close to the matter tell TechCrunch that the price tag was $700 million
Apple has acquired the Paris-based artificial intelligence startup Datakalab amid its push to deliver on-device AI tools.	Apple has acquired the Paris-based artificial intelligence startup Datakalab amid its push to deliver on-device AI tools.
Drake Uses AI Tupac and Snoop Dogg Vocals on ‘Taylor Made Freestyle,’ References Taylor Swift’s New Album ‘The Tortured Poets Department’.	On Friday night (April 19), the rapper released a song on his social media entitled “Taylor Made Freestyle,” which uses AI vocals from Tupac Shakur and Snoop Dogg on a stopgap between diss records as he awaits Kendrick Lamar’s reply to his freshly released “Push Ups.”

Resources

Link	description
Fine-tune Llama 3 with ORPO.	ORPO is a new exciting fine-tuning technique that combines the traditional supervised fine-tuning and preference alignment stages into a single process. This reduces the computational resources and time required for training. Moreover, empirical results demonstrate that ORPO outperforms other alignment methods on various model sizes and benchmarks.
Mistral Common.	Mistral-common is a set of tools to help you work with Mistral models. Our first release contains tokenization. Our tokenizers go beyond the usual text <-> tokens, adding parsing of tools and structured conversation. We also release the validation and normalization code that is used in our API.
LongEmbed.	This repository is the official implementation for the paper "LongEmbed: Extending Embedding Models for Long Context Retrieval"
FineWeb: 15T high quality web tokens.	15T tokens were used to train the most recent Llama 3 models. This new dataset yields high-quality models and includes a large deduplicated corpus from the common crawl.
A Visual Guide to Vision Transformers.	This is a visual guide to Vision Transformers (ViTs), a class of deep learning models that have achieved state-of-the-art performance on image classification tasks. This guide will walk you through the key components of Vision Transformers in a scroll story format, using visualizations and simple explanations to help you understand how these models work and what the flow of the data through the model looks like.
The Cauldron VLM data.	50 language and vision datasets merged into a single format to enable better model training.
MAexpA Generic Platform for RL-based Multi-Agent Exploration.	MAexp, a generic high-efficiency platform designed for multi-agent exploration, encompassing a diverse range of scenarios and MARL algorithms.
Practitioners Guide to Triton.	A high-level language for creating low-level CUDA kernels is called Triton. It lets you write in a Python-style format and significantly improves the efficiency of your AI model.
Efficiently fine-tune Llama 3 with PyTorch FSDP and Q-Lora.	Great blog covering a quick and efficient fine-tuning method using PyTorch on the recent Llama 3 model.
Layer Pruning of Large Language Models.	This repository hosts the unofficial implementation of a layer pruning strategy for Large Language Models (LLMs) based on the insights from the paper "The Unreasonable Ineffectiveness of the Deeper Layers" by Andrey Gromov et al.
A Trivial Jailbreak Against Llama 3.	A trivial programmatic Llama 3 jailbreak.
LLaMA3-Quantization.	Given the wide application of low-bit quantization for LLMs in resource-limited scenarios, we explore LLaMa3's capabilities when quantized to low bit-width. This exploration holds the potential to unveil new insights and challenges for low-bit quantization of LLaMa3 and other forthcoming LLMs, especially in addressing performance degradation problems that suffer in LLM compression.
Instructor: Structured LLM Outputs.	Instructor is a Python library that makes it a breeze to work with structured outputs from large language models (LLMs). Built on top of Pydantic, it provides a simple, transparent, and user-friendly API to manage validation, retries, and streaming responses. Get ready to supercharge your LLM workflows!
How does ChatGPT work? As explained by the ChatGPT team.	Sometimes the best explanations of how a technology solution works come from the software engineers who built it. To explain how ChatGPT (and other large language models) operate, I turned to the ChatGPT engineering team.
BitBLAS.	A collection of GPU-accelerated kernels for BitNet-style model training has been made available by Microsoft. These devices offer a significant reduction in memory usage without sacrificing much accuracy.
CoreNet: A library for training deep neural networks.	CoreNet is a deep neural network toolkit from Apple that allows researchers and engineers to train standard and novel small and large-scale models for variety of tasks, including foundation models (e.g., CLIP and LLM), object classification, object detection, and semantic segmentation.
MaxText.	MaxText is a high-performance, highly scalable, open-source LLM written in pure Python/Jax and targeting Google Cloud TPUs and GPUs for training and inference. MaxText achieves high MFUs and scales from single hosts to very large clusters while staying simple and "optimization-free" thanks to the power of Jax and the XLA compiler.
Cohere Toolkit.	A chat interface with numerous useful capabilities for creating AI-powered chat apps has been made available by Cohere.
BAAI/Bunny-Llama-3-8B-V.	Bunny is a family of lightweight but powerful multimodal models. It offers multiple plug-and-play vision encoders, like EVA-CLIP, SigLIP, and language backbones, including Llama-3-8B, Phi-1.5, StableLM-2, and Phi-2. To compensate for the decrease in model size, we construct more informative training data by curated selection from a broader data source.
Finetune Llama 3 - 2x faster + 6x longer context + 68% less VRAM.	6x long context length with dramatically less VRAM usage than HF with flash attention.

Perspectives

Link	description
Self-Reasoning Tokens, teaching models to think ahead.	This paper presents "reasoning tokens" for language models, which produce more tokens intended to forecast future tokens instead of the one that is immediately next, improving the model's anticipatory capacity. Experiments show notable increases in prediction accuracy, indicating that more sophisticated reasoning may be possible without the need for explicit step-by-step training.
Looking for AI use-cases.	This article explores the potential for transformation and the existing constraints of generative AI, such as ChatGPT. It points out that although ChatGPT performs well on simple tasks like coding and creating drafts, it has trouble with more complicated tasks that call for specialized programming. It emphasizes the necessity of a vision that links AI solutions with useful applications and stresses how difficult it is to find and incorporate these into regular workflows.
Building reliable systems out of unreliable agents.	Although AI agents aren't always dependable, they can be used to create dependable systems. A few strategies are to start with basic prompts and build an iterative improvement evaluation system; to deploy with observability; to use Retrieval Augmented Generation (RAG); to think about fine-tuning the model; and to use complementary agents to strengthen each other's weaknesses and increase the overall reliability of the system.
AI leads a service-as-software paradigm shift.	Many VCs are talking about AI taking a bite out of the services business. Foundation Capital believes there is $4.6 trillion worth of work to be automated, thanks to AI: both for in-house functions and outsourced services. We're entering the era of Service-as-Software.
How AI is improving climate forecasts.	Researchers are using various machine-learning strategies to speed up climate modeling, reduce its energy costs and hopefully improve accuracy.
Will AI accelerate or delay the race to net-zero emissions?	As artificial intelligence transforms the global economy, researchers need to explore scenarios to assess how it can help, rather than harm, the climate.
The Biggest Open-Source Week in the History of AI.	The last week of March 2024 will go down as a unique moment for Open-source LLMs. China's open-source scene hits the ground running.
‘Miss AI’ is billed as a leap forward – but feels like a monumental step backward.	AI models take every toxic gendered beauty norm and bundle them up into completely unrealistic package
Why reliable AI requires a paradigm shift.	Hallucinations are the fundamental barrier to the widespread use of AI, and they won't be solved anytime soon.
Should Apple Kill Siri and Start Over?	The vision was grand: A personal assistant in your pocket, capable of understanding and acting upon a wide array of voice commands with ease and accuracy. So what happened?

Back to index

ML news: Week 15 - 21 April

Research

Link	description
DGMamba: Domain Generalization via Generalized State Space Model.	DGMamba is a new framework that makes use of the novel state space model Mamba to address domain generalization problems.
Manipulating Large Language Models to Increase Product Visibility.	Search engines' extensive language models can be manipulated by adding strategic text sequences to product descriptions to promote specific products.
MindBridge: A Cross-Subject Brain Decoding Framework.	MindBridge is a single model that can interpret brain activity from several subjects.
Taming Stable Diffusion for Text to 360° Panorama Image Generation.	With the help of text prompts, this project presents PanFusion, a dual-branch diffusion model that creates 360-degree panoramic images. To minimize visual distortion, the technique combines the Stable Diffusion approach with a customized panoramic branch, which is further improved by a special cross-attention mechanism.
The Physics of Language Models.	Scaling laws describe the relationship between the size of language models and their capabilities. Unlike prior studies that evaluate a model's capability via loss or benchmarks, we estimate the number of knowledge bits a model stores.
The Influence Between NLP and Other Fields.	attempts to measure the level of influence that NLP has over 23 different fields of study; the cross-field engagement of NLP has decreased from 0.58 in 1980 to 0.31 in 2022; the study also reveals that CS dominates NLP citations, accounting for over 80% of citations with a focus on information retrieval, AI, and ML; in general, NLP is becoming more isolated, with a rise in intra-field citations and a fall in multidisciplinary works.
EventEgo3D: 3D Human Motion Capture from Egocentric Event Streams.	Researchers present a unique technique utilizing a fisheye event camera to address the difficulties in monocular egocentric 3D human motion capture, particularly in challenging lighting conditions and with rapid motions.
MPPE-DST: Mixture of Prefix Prompt Experts for LLM in Zero-Shot Dialogue State Tracking.	Mixture of Prefix Prompt Experts (MPPE) is a novel approach that has been created by researchers to improve zero-shot dialogue state tracking. This technique allows knowledge to be transferred to new domains without requiring additional dataset annotations.
Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding.	A novel technique called Any2Point effectively transfers vision, language, and audio model capabilities into the 3D space while preserving spatial geometries.
Google’s new technique gives LLMs infinite context.	A new paper by researchers at Google claims to give large language models (LLMs) the ability to work with the text of infinite length. The paper introduces Infini-attention, a technique that configures language models in a way that extends their “context window” while keeping memory and compute requirements constant.
Compression Represents Intelligence Linearly.	The concept of compressing a training dataset into a model is the foundation of most contemporary AI. The model gets better the better the compression. This research establishes a high correlation between scale benchmark scores and a model's capacity to condense novel material by thoroughly demonstrating that relationship.
TransformerFAM: Feedback attention is working memory.	Transformers may take care of their own latent representations thanks to TransformerFAM's feedback system. In theory, this might allow the model to process incredibly long inputs in context by adding repetition.
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length.	Another lengthy context paper, but this one is about a new design that makes use of two cutting-edge weight updating techniques. In comparison, Llama 2 underperformed on the same training token count (2T). Additionally, at inference time, it scales to an indefinite context length.
STORM: Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking.	Retrieval-guided language models are used by Stanford's innovative research system, Storm, to generate reports for particular subjects.
Homography Guided Temporal Fusion for Road Line and Marking Segmentation.	Road lines and markings must be accurately segmented for autonomous driving, however this is difficult because of sunlight, shadows, and car occlusions. The Homography Guided Fusion (HomoFusion) module employs a pixel-by-pixel attention mechanism and a unique surface normal estimator to recognize and classify obscured road lines from video frames.
LaSagnA: vLLM-based Segmentation Assistant for Complex Queries.	Vision Language Models (vLLMs) sometimes face difficulties in distinguishing absent objects and handling many queries per image. To address these problems, this work presents a novel question style and integrates semantic segmentation into the training procedure.
A collective AI via lifelong learning and sharing at the edge.	Here we review recent machine learning advances converging towards creating a collective machine-learned intelligence. We propose that the convergence of such scientific and technological advances will lead to the emergence of new types of scalable, resilient, and sustainable AI systems.
Challenges and opportunities in translating ethical AI principles into practice for children.	This Perspective first maps the current global landscape of existing ethics guidelines for AI and analyses their correlation with children.
Mistral 8x22B Report and Instruction Model.	Mixtral 8x22B is our latest open model. It sets a new standard for performance and efficiency within the AI community. It is a sparse Mixture-of-Experts (SMoE) model that uses only 39B active parameters out of 141B, offering unparalleled cost efficiency for its size.
Long-form music generation with latent diffusion.	Stability AI's diffusion transformer model for audio synthesis.
LaDiC: A Diffusion-based Image Captioning Model.	The use of diffusion models for image-to-text generation is revisited in this work. It presents the LaDiC architecture, which improves the image captioning tasks performance of diffusion models.
LINGO-2: Driving with Natural Language.	This blog introduces LINGO-2, a driving model that links vision, language, and action to explain and determine driving behavior, opening up a new dimension of control and customization for an autonomous driving experience. LINGO-2 is the first closed-loop vision-language-action driving model (VLAM) tested on public roads.
Towards a general-purpose foundation model for computational pathology.	We introduce UNI, a general-purpose self-supervised model for pathology, pre-trained using more than 100 million images from over 100,000 diagnostic H&E-stained WSIs (>77 TB of data) across 20 major tissue types.
A visual-language foundation model for computational pathology.	We introduce CONtrastive learning from Captions for Histopathology (CONCH), a visual-language foundation model developed using diverse sources of histopathology images, biomedical text and, notably, over 1.17 million image–caption pairs through task-agnostic pretraining.
FedPFT: Federated Proxy Fine-Tuning of Foundation Models.	Federated Proxy Fine-Tuning (FedPFT), a novel technique created by researchers, enhances foundation models' ability to adjust for certain tasks while maintaining data privacy.
In-Context Learning State Vector with Inner and Momentum Optimization.	In this research, a novel method for improving In-Context Learning (ICL) in big language models such as GPT-J and Llama-2 is presented. The authors introduce a novel optimization technique that enhances compressed representations of the model's knowledge, referred to as "state vectors."
Decomposing and Editing Predictions by Modeling Model Computation.	To determine each component's precise contribution to the final result, component modeling dissects a model's prediction process into its most fundamental parts, such as attention heads and convolution filters.

News

Link	description
Grok-1.5 Vision Preview.	Introducing Grok-1.5V, our first-generation multimodal model. In addition to its strong text capabilities, Grok can now process a wide variety of visual information, including documents, diagrams, charts, screenshots, and photographs. Grok-1.5V will be available soon to our early testers and existing Grok users.
Google’s new chips look to challenge Nvidia, Microsoft, and Amazon.	Google’s new AI chip is a rival to Nvidia, and its Arm-based CPU will compete with Microsoft and Amazon
OpenAI Fires Researchers For Leaking Information.	After months of leaks, OpenAI has apparently fired two researchers who are said to be linked to company secrets going public.
BabyLM Challenge.	The goal of this shared task is to incentivize researchers with an interest in pretraining or cognitive modeling to focus their efforts on optimizing pretraining given data limitations inspired by human development. Additionally, we hope to democratize research on pretraining—which is typically thought to be practical only for large industry groups—by drawing attention to open problems that can be addressed on a university budget.
Dr. Andrew Ng appointed to Amazon’s Board of Directors.	Dr. Andrew Ng is currently the Managing General Partner of AI Fund and is joining Amazon's Board of Directors.
Creating sexually explicit deep fake images to be made offense in UK.	Offenders could face jail if the image is widely shared under a proposed amendment to criminal justice bill
Leisure centers scrap biometric systems to keep tabs on staff amid UK data watchdog clampdown.	Firms such as Serco and Virgin Active pull facial recognition and fingerprint scan systems used to monitor staff attendance
Introducing OpenAI Japan.	We are excited to announce our first office in Asia and we’re releasing a GPT-4 custom model optimized for the Japanese language.
Adobe’s working on generative video, too.	Adobe says it’s building an AI model to generate video. But it’s not revealing when this model will launch, exactly — or much about it besides the fact that it exists.
OpenAI and Meta Reportedly Preparing New AI Models Capable of Reasoning.	OpenAI and Meta are on the verge of releasing the next versions of their AI models that will supposedly be capable of reasoning and planning, the Financial Times reports. But, as with any hype coming out of big tech, take it all with a grain of salt.
Humane’s Ai Pin Isn't Ready to Replace Your Phone, But One Day It Might.	AI-powered wearable Humane's Ai Pin has numerous technical problems, ranging from AI assistant glitches to music streaming concerns. Though future software updates are promised, the first-generation gadget lacks crucial functions and experiences performance gaps despite its intention to create an ambient computing experience. The Ai Pin is positioned as a companion device for a more present and less screen-focused lifestyle, yet it struggles to replace conventional smartphones despite its meticulous design.
TikTok may add AI avatars that can make ads.	he new feature will let advertisers and TikTok Shop sellers generate scripts for a virtual influencer to read.
Google launches Code Assist, its latest challenger to GitHub’s Copilot.	At its Cloud Next conference, Google on Tuesday unveiled Gemini Code Assist, its enterprise-focused AI code completion and assistance tool.
AI traces mysterious metastatic cancers to their source.	algorithm examines images of metastatic cells to identify the location of the primary tumor. Some stealthy cancers remain undetected until they have spread from their source to distant organs. Now scientists have developed an artificial intelligence (AI) tool that outperforms pathologists at identifying the origins of metastatic cancer cells that circulate in the body
Apple's iOS 18 AI will be on-device preserving privacy, and not server-side.	Apple's AI push in iOS 18 is rumored to focus on privacy with processing done directly on the iPhone, that won't connect to cloud services.
Introducing ALOHA Unleashed.	Google DeepMind's ALOHA Unleashed is a program that pushes the boundaries of dexterity with low-cost robots and AI.
France's Mistral AI seeks funding at $5 bln valuation, The Information reports.	French tech startup Mistral AI has been speaking to investors about raising several hundred million dollars at a valuation of $5 billion, The Information reported on Tuesday.
Stability AI is giving more developers access to its next-gen text-to-image generator.	Developers can now access the API for the latest version of Stability AI’s text-to-image model.
European car manufacturer will pilot Sanctuary AI’s humanoid robot.	Sanctuary AI announced that it will be delivering its humanoid robot to a Magna manufacturing facility. Based in Canada, with auto manufacturing facilities in Austria, Magna manufactures and assembles cars for several Europe’s top automakers, including Mercedes, Jaguar, and BMW. As is often the nature of these deals, the parties have not disclosed how many of Sanctuary AI’s robots will be deployed.
Google Maps will use AI to help you find out-of-the-way EV chargers .	The company will use AI to summarize directions to EV chargers as well as reliability and wait times.
Introducing Meta Llama 3: The most capable openly available LLM to date.	Today, we’re introducing Meta Llama 3, the next generation of our state-of-the-art open-source large language model. Llama 3 models will soon be available on AWS, Databricks, Google Cloud, Hugging Face, Kaggle, IBM WatsonX, Microsoft Azure, NVIDIA NIM, and Snowflake, and with support from hardware platforms offered by AMD, AWS, Dell, Intel, NVIDIA, and Qualcomm.
Google’s Deep Mind AI can help engineers predict “catastrophic failure”.	AI and a popular card game can help engineers predict catastrophic failure by finding the absence of a pattern.
OpenAI winds down AI image generator that blew minds and forged friendships in 2022.	When OpenAI's DALL-E 2 debuted on April 6, 2022, the idea that a computer could create relatively photorealistic images on demand based on just text descriptions caught a lot of people off guard. The launch began an innovative and tumultuous period in AI history, marked by a sense of wonder and a polarizing ethical debate that reverberates in the AI space to this day. Last week, OpenAI turned off the ability for new customers to purchase generation credits for the web version of DALL-E 2, effectively killing it.
Stability AI lays off roughly 10 percent of its workforce.	Stability AI laid off 20 employees just a day after announcing the expansion of access to its new flagship model. This comes after weeks of upheaval that saw its founding CEO leave the company.
The Humane AI Pin is lost in translation.	Though the Humane AI Pin has a lot of drawbacks, its translation feature might be the worst.

Resources

Link	description
LLM-friendly HTML conversion.	Reader converts any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/. Get improved output for your agent and RAG systems at no cost.
Minimal Implementation of a D3PM (Structured Denoising Diffusion Models in Discrete State-Spaces), in pytorch.	This is a minimal (400 LOC), but fully faithful implementation of a D3PM Structured Denoising Diffusion Models in Discrete State-Spaces. in pytorch.
Cerule - A Tiny Mighty Vision Model.	We train and release "Cerule", a tiny yet powerful Vision Language Model based on the newly released Google's Gemma-2b and Google's SigLIP.
Diffusion Models for Video Generation.	This article looks at adapting image models, training diffusion models to produce video, and even producing video directly from an image model without further training.
Pile-T5.	The contemporary AI workhorse is called T5. Eleuther retrained it using a more recent tokenizer and a longer training period. As a consequence, the fundamental model for encoding tasks is significantly enhanced.
GitHub Repository to File Converter.	This Python script allows you to download and process files from a GitHub repository, making it easier to share code with chatbots that have large context capabilities but don't automatically download code from GitHub.
AI Index Report.	The 2024 Index is our most comprehensive to date and arrives at an important moment when AI’s influence on society has never been more pronounced. This year, we have broadened our scope to more extensively cover essential trends such as technical advancements in AI, public perceptions of the technology, and the geopolitical dynamics surrounding its development.
Accelerating AI: Harnessing Intel(R) Gaudi(R) 3 with Ray 2.10.	Ray 2.10, the most recent version from Anyscale, now supports Intel Gaudi 3. In addition to provisioning Ray Core Task and Actors on a Gaudi fleet directly through Ray Core APIs, developers can now spin up and manage their own Ray Clusters. For an enhanced experience, they can also utilize Ray Serve on Gaudi via Ray Serve APIs and set up Intel Gaudi accelerator infrastructure for use at the Ray Train layer.
Code with CodeQwen1.5.	Notwithstanding these advancements, dominant coding assistants like Github Copilot, built upon proprietary LLMs, pose notable challenges in terms of cost, privacy, security, and potential copyright infringement. Today, we are delighted to introduce a new member of the Qwen1.5 open-source family, the CodeQwen1.5-7B, a specialized codeLLM built upon the Qwen1.5 language model. CodeQwen1.5-7B has been pre-trained with around 3 trillion tokens of code-related data. It supports an extensive repertoire of 92 programming languages, and it exhibits exceptional capacity in long-context understanding and generation with the ability to process information of 64K tokens.
OLMo 1.7–7B: A 24 point improvement on MMLU.	Today, we’ve released an updated version of our 7 billion parameter Open Language Model, OLMo 1.7–7B. This model scores 52 on MMLU, sitting above Llama 2–7B and approaching Llama 2–13B, and outperforms Llama 2–13B on GSM8K.
Effort.	With the use of the Effort library, one can alter in real-time how many calculations are made when inferring an LLM model, which can significantly increase performance while maintaining a high level of quality. Initial findings indicate that the Effort library has the potential to greatly increase LLM inference speed while preserving quality, even with modest implementation overhead. In order to further enhance the library, the author invites others to test the 0.0.1B version and offer feedback.
luminal.	Luminal is a deep-learning library that uses composable compilers to achieve high performance.
SoccerNet Game State Reconstruction: End-to-End Athlete Tracking and Identification on a Minimap.	A new dataset called SoccerNet-GSR aims to improve game state reconstruction from football video footage captured by a single camera.
AI Gateway.	Gateway streamlines requests to 100+ open & closed source models with a unified API. It is also production-ready with support for caching, fallbacks, retries, timeouts, load balancing, and can be edge-deployed for minimum latency.
moondream.	a tiny vision language model that kicks ass and runs anywhere
Sentence Embeddings. Introduction to Sentence Embeddings.	This series aims to demystify embeddings and show you how to use them in your projects. This first blog post will teach you how to use and scale up open-source embedding models. We’ll look into the criteria for picking an existing model, current evaluation methods, and the state of the ecosystem.

Perspectives

Link	description
Does AI need a “body” to become truly intelligent? Meta researchers think so.	AIs that can generate videos, quickly translate languages or write new computer code could be world-changing, but can they ever be truly intelligent? Not according to the embodiment hypothesis, which argues that human-level intelligence can only emerge if intelligence is able to sense and navigate a physical environment, the same way babies can.
Micromanaging AI.	Currently, AI is classified as micromanage, which requires people to establish tasks, assess work frequently, and lead development at each stage, akin to managing high school interns. Motivation is high but competence level is rather low.
‘Eat the future, pay with your face’: my dystopian trip to an AI burger joint.	If the experience of robot-served fast food dining is any indication, the future of sex robots is going to be very unpleasant
AI now beats humans at basic tasks — new benchmarks are needed, says the major report.	Stanford University’s 2024 AI Index charts the meteoric rise of artificial intelligence tools. Artificial intelligence (AI) systems, such as the chatbot ChatGPT, have become so advanced that they now very nearly match or exceed human performance in tasks including reading comprehension, image classification, and competition-level mathematics, according to a new report.
Lethal dust storms blanket Asia every spring — now AI could help predict them.	As the annual phenomenon once again strikes East Asia, scientists are hard at work to better predict how they will affect people.
From boom to burst, the AI bubble is only heading in one direction.	No one should be surprised that artificial intelligence is following a well-worn and entirely predictable financial arc
You can't build a moat with AI.	Differentiating AI is difficult, but the secret is in the unique data that is supplied into these models—not in the AI models themselves, which are becoming commodity-like. Take LLMs, for example. The performance of AI is strongly impacted by effective data engineering since applications need to integrate customer-specific data to respond accurately. Thus, rather than the AI technology itself, gaining a competitive edge in AI applications depends on creative data utilization.
Towards 1-bit Machine Learning Models.	Recent works on extreme low-bit quantization such as BitNet and 1.58 bit have attracted a lot of attention in the machine learning community. The main idea is that matrix multiplication with quantized weights can be implemented without multiplications, which can potentially be a game-changer in terms of compute efficiency of large machine learning models.
From Idea to Integration: Four Steps for Founders Integrating AI.	There is currently a great deal of push to incorporate AI into current goods. This brief, step-by-step manual will assist you in making the initial move.
Use game theory for climate models that really help reach net zero goals.	Many countries and companies have committed to eliminating their greenhouse gas emissions by the middle of the century. Yet most of these pledges lack a clear policy pathway.
A step along the path towards AlphaFold — 50 years ago.	Paring down the astronomical complexity of the protein-folding problem
The democratization of global AI governance and the role of tech companies.	Can non-state multinational tech companies counteract the potential democratic deficit in the emerging global governance of AI? We argue that although they may strengthen core values of democracy such as accountability and transparency, they currently lack the right kind of authority to democratize global AI governance.
The new NeuroAI.	After several decades of developments in AI, has the inspiration that can be drawn from neuroscience been exhausted? Recent initiatives make the case for taking a fresh look at the intersection between the two fields.
Connecting molecular properties with plain language.	AI tools such as ChatGPT can provide responses to queries on any topic, but can such large language models accurately ‘write’ molecules as output to our specification? Results now show that models trained on general text can be tweaked with small amounts of chemical data to predict molecular properties, or to design molecules based on a target feature.
MLOps vs. Eng: Misaligned Incentives and Failure to Launch?	An in-depth discussion on the difficulties and solutions associated with implementing AI models in production, as well as how MLOps varies from traditional engineering, with industry experts. They talk about how to focus as a company to truly launch and why so few ML ideas ever reach production.
Is Attention All You Need?	In order to overcome Transformers' shortcomings in long-context learning, generation, and inference speed, researchers are creating alternative designs that exhibit competitive quality at smaller scales but questionable scalability. Because of the quick development in this area, the Pareto frontier will likely keep growing, opening up more opportunities for lengthier context modeling and higher throughput inference, which will ultimately lead to a bigger variety of AI use cases.
The Shifting Dynamics And Meta-Moats Of AI.	Managing complex short-, mid-, and long-term dynamics while retaining elite speed and execution, owning more of the stack, obtaining unique data, and utilizing synthetic data production are all necessary for building a successful AI business. As the AI sector develops, businesses will need to adjust to changing labor dynamics, comprehend the machine they are creating, and recognize the competitive axes on which they are based in order to forge long-lasting moats and differentiate themselves from the crowd.
Integration of AI in healthcare requires an interoperable digital data ecosystem.	Electronic health information, including electronic health records, is needed to develop AI tools for health, but the seamless flow of data will require standards and interoperability.
To do no harm — and the most good — with AI in health care.	Drawing from real-life scenarios and insights shared at the RAISE (Responsible AI for Social and Ethical Healthcare) conference, we highlight the critical need for AI in health care (AIH) to primarily benefit patients and address current shortcomings in healthcare systems such as medical errors and access disparities.
How to support the transition to AI-powered healthcare.	To make health systems more sustainable in the long-term, incentivize artificial intelligence (AI) and digital technologies that are grounded on careful testing and real-world validation.
The increasing potential and challenges of digital twins.	This issue of Nature Computational Science includes a Focus that highlights recent advancements, challenges, and opportunities in the development and use of digital twins across different domains.
The Space Of Possible Minds.	Sophisticated AIs are stretching the boundaries of our understanding of what it is to be human and forcing us to consider how we embody agency and true understanding in a spectrum of intelligent beings. Creating mutually beneficial relationships between radically different entities, recognizing the similarities and differences among various forms of intelligence, and developing principled frameworks for scaling our moral concern to the essential qualities of being are all necessary to navigate this new terrain.
CUDA is Still a Giant Moat for NVIDIA.	NVIDIA's proprietary interconnects and CUDA software environment, in addition to its hardware, continue to solidify the company's leadership in the AI market. The ease of use and performance optimization of CUDA makes it superior to alternatives like AMD's ROCM, guaranteeing that NVIDIA's GPUs continue to be the go-to option for AI tasks. NVIDIA's dominance in AI computing is strengthened by its investments in the CUDA ecosystem and community education.

Back to index

ML news: Week 8 - 14 April

Research

Link	description
Smartphone app could help detect early-onset dementia cause, study finds.	App-based cognitive tests found to be proficient at detecting frontotemporal dementia in those most at risk. Scientists have demonstrated that cognitive tests done via a smartphone app are at least as sensitive at detecting early signs of frontotemporal dementia in people with a genetic predisposition to the condition as medical evaluations performed in clinics.
Unsegment Anything by Simulating Deformation.	A novel strategy called "Anything Unsegmentable" aims to prevent digital photos from being divided into discrete categories by potent AI models, potentially resolving copyright and privacy concerns.
Evaluating LLMs at Detecting Errors in LLM Responses.	A benchmark called ReaLMistake has been introduced by researchers to methodically identify mistakes in lengthy language model answers.
Dynamic Prompt Optimizing for Text-to-Image Generation.	Researchers have created Prompt Auto-Editing (PAE), a technique that uses diffusion models such as Imagen and Stable Diffusion to advance text-to-image generation. With the use of online reinforcement learning, this novel method dynamically modifies the weights and injection timings of particular words to automatically improve text prompts.
No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation.	A system called Seg-NN simplifies the 3D segmentation procedure. These models don't have the usual domain gap problems and can quickly adapt to new, unseen classes because they don't require a lot of pre-training.
Foundation Model for Advancing Healthcare: Challenges, Opportunities, and Future Directions.	The potential of Healthcare Foundation Models (HFMs) to transform medical services is examined in this extensive survey. These models are well-suited to adapt to different healthcare activities since they have been pre-trained on a variety of data sets. This could lead to an improvement in intelligent healthcare services in a variety of scenarios.
SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing.	A new algorithm called SwapAnything may swap out objects in an image with other objects of your choosing without affecting the image's overall composition. Compared to other tools, it is superior since it can replace any object, not only the focal point, and it excels at ensuring that the replaced object blends seamlessly into the original image. Pretrained diffusion model, idea vectors, and inversion are employed.
UniFL:Improve Stable Diffusion via Unified Feedback Learning.	UniFL is a technique that uses a pretty complex cascade of feedback steps to enhance the output quality of diffusion models. All of these help to raise the image generation models' aesthetics, preference alignment, and visual quality. The methods can be applied to enhance any image generating model, regardless of the underlying model.
Object-Aware Domain Generalization for Object Detection.	In order to tackle the problem of object detection in single-domain generalization (S-DG), the novel OA-DG approach presents two new techniques: OA-Mix for data augmentation and OA-Loss for training.
VAR: a new visual generation method elevates GPT-style models beyond diffusion🚀 & Scaling laws observed.	Code for the latest "next-resolution prediction" project, which presents the process of creating images as a progressive prediction of progressively higher resolution. A demo notebook and inference scripts are included in the repository. Soon, the training code will be made available.
SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget.	SqueezeAttention is a newly developed technique that optimizes the Key-Value cache of big language models, resulting in a 30% to 70% reduction in memory usage and a doubling of throughput.
Measuring the Persuasiveness of Language Models.	The Claude 3 Opus AI model was shown to closely resemble human persuasiveness in a study that looked at persuasiveness. Statistical tests and multiple comparison adjustments were used to ascertain this. Although not by a statistically significant amount, humans were marginally more convincing, highlighting a trend where larger, more complex models are becoming more credible. The most persuasive model was found to be Claude 3 Opus. The study's methodological reliability was validated by a control condition that demonstrated predictable low persuasiveness for undisputed facts.
DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation.	DreamView presents a novel method for turning text descriptions into 3D objects that may be extensively customized from various angles while maintaining the object's overall consistency.
Hash3D: Training-free Acceleration for 3D Generation.	By adopting a hashing algorithm that takes use of feature-map redundancy across similar camera positions and diffusion time-steps, Hash3D presents a revolutionary way to accelerate 3D generative modeling.
MoCha-Stereo: Motif Channel Attention Network for Stereo Matching.	An innovative method that keeps geometric structures that are sometimes lost in conventional stereo matching techniques is the Motif Channel Attention Stereo Matching Network (MoCha-Stereo).
Efficient and Generic Point Model for Lossless Point Cloud Attribute Compression.	PoLoPCAC is a lossless point cloud attribute compression technique that combines excellent adaptability and great efficiency at different point cloud densities and scales.
Scaling Multi-Camera 3D Object Detection through Weak-to-Strong Eliciting.	In order to boost surround refinement in Multi-Camera 3D Object Detection (MC3D-Det), a field enhanced by bird's-eye view technologies, this study introduces a weak-to-strong eliciting framework.
InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models.	This project introduces InstantMesh, a framework with unparalleled quality and scalability that creates 3D meshes instantaneously from a single image.
Exploring Concept Depth: How Large Language Models Acquire Knowledge at Different Layers?	A recent study examined the ways in which different layers within huge language models understand distinct concepts. It was discovered that while more complicated tasks demand deeper processing, simpler tasks are handled by earlier layers.
SplatPose & Detect: Pose-Agnostic 3D Anomaly Detection.	SplatPose is a revolutionary approach that uses 3D Gaussian splatting to address the problem of anomaly identification in 3D objects from different positions.

News

Link	description
Facebook and Instagram to label digitally altered content ‘made with AI’.	Parent company Meta also to add ‘high-risk’ label to Al-altered content that deceives the public on "a matter of importance"
Google considering charge for internet searches with AI, reports say.	Cost of artificial intelligence service could mean leaders in sector turning to subscription models
Apple lays off 600 workers in California after shuttering self-driving car project.	Tech company cuts employees from eight offices in Santa Clara in its first big wave of post-pandemic job cuts
AMD to open source Micro Engine Scheduler firmware for Radeon GPUs.	AMD plans to document and open source its Micro Engine Scheduler (MES) firmware for GPUs, giving users more control over Radeon graphics cards.
Investors in talks to help Elon Musk's xAI raise $3 billion: report.	Investors close to Elon Musk are in talks to help his artificial-intelligence startup xAI raise $3 billion in a round that would value the company at $18 billion, the Wall Street Journal reported on Friday.
Introducing Command R+: A Scalable LLM Built for Business.	Command R+, a potent, scalable LLM with multilingual coverage in ten important languages and tool use capabilities, has been launched by Cohere. It is intended for use in enterprise use scenarios.
Qwen1.5-32B: Fitting the Capstone of the Qwen1.5 Language Model Series.	A growing consensus within the field now points to a model with approximately 30 billion parameters as the optimal “sweet spot” for achieving both strong performance and manageable resource requirements. In response to this trend, we are proud to unveil the latest additions to our Qwen1.5 language model series: Qwen1.5-32B and Qwen1.5-32B-Chat.
Nvidia Tops Llama 2, Stable Diffusion Speed Trials .	Now that we’re firmly in the age of massive generative AI, it’s time to add two such behemoths, Llama 2 70B and Stable Diffusion XL, to MLPerf’s inferencing tests. Version 4.0 of the benchmark tests more than 8,500 results from 23 submitting organizations. As has been the case from the beginning, computers with Nvidia GPUs came out on top, particularly those with its H200 processor. But AI accelerators from Intel and Qualcomm were in the mix as well.
Rabbit partners with ElevenLabs to power voice commands on its device.	Hardware maker Rabbit has tapped a partnership with ElevenLabs to power voice commands on its devices. Rabbit is set to ship the first set of r1 devices next month after getting a ton of attention at the Consumer Electronics Show (CES) at the start of the year.
DALL-E now lets you edit images in ChatGPT.	Tweak your AI creations without leaving the chat.
Jony Ive and OpenAI's Sam Altman Seeking Funding for Personal AI Device.	OpenAI CEO Sam Altman and former Apple design chief Jony Ive have officially teamed up to design an AI-powered personal device and are seeking funding, reports The Information.
Hugging Face TGI Reverts to Open Source License.	Hugging Face temporarily granted a non-commercial license for their well-known and potent inference server in an effort to deter bigger companies from running a rival offering. While community involvement decreased, business outcomes remained unchanged. It is now back to a license that is more liberal.
Securing Canada’s AI advantage.	To support Canada's AI industry, Prime Minister Justin Trudeau unveiled a $2.4 billion investment package beginning with Budget 2024. The package comprises tools to enable ethical AI adoption, support for AI start-ups, and financing for computational skills. These policies are intended to maintain Canada's competitive advantage in AI globally, boost productivity, and hasten the growth of jobs. The money will also be used to fortify the Artificial Intelligence and Data Act's enforcement as well as establish a Canadian AI Safety Institute.
Yahoo is buying Artifact, the AI news app from the Instagram co-founders.	Instagram’s co-founders built a powerful and useful tool for recommending news to readers — but could never quite get it to scale. Yahoo has hundreds of millions of readers — but could use a dose of tech-forward cool to separate it from all the internet’s other news aggregators.
Now there’s an AI gas station with robot fry cooks.	There’s a little-known hack in rural America: you can get the best fried food at the gas station (or in the case of a place I went to on my last road trip, shockingly good tikka masala). Now, one convenience store chain wants to change that with a robotic fry cook that it’s bringing to a place once inhabited by a person who may or may not smell like a recent smoke break and cooks up a mean fried chicken liver.
Elon Musk predicts superhuman AI will be smarter than people next year.	His claims come with a caveat that shortages of training chips and growing demand for power could limit plans in the near term
Gemma Family Expands with Models Tailored for Developers and Researchers.	Google announced the first round of additions to the Gemma family, expanding the possibilities for ML developers to innovate responsibly: CodeGemma for code completion and generation tasks as well as instruction following, and RecurrentGemma, an efficiency-optimized architecture for research experimentation.
Meta confirms that its Llama 3 open source LLM is coming in the next month.	At an event in London on Tuesday, Meta confirmed that it plans an initial release of Llama 3 — the next generation of its large language model used to power generative AI assistants — within the next month.
Intel details Gaudi 3 at Vision 2024 — new AI accelerator sampling to partners now, volume production in Q3.	Intel made a slew of announcements during its Vision 2024 event today, including deep-dive details of its new Gaudi 3 AI processors, which it claims offer up to 1.7X the training performance, 50% better inference, and 40% better efficiency than Nvidia’s market-leading H100 processors, but for significantly less money.
Apple's new AI model could help Siri see how iOS apps work.	Apple's Ferret LLM could help allow Siri to understand the layout of apps in an iPhone display, potentially increasing the capabilities of Apple's digital assistant. Apple has been working on numerous machine learning and AI projects that it could tease at WWDC 2024. In a just-released paper, it now seems that some of that work has the potential for Siri to understand what apps and iOS itself looks like.
Aerospace AI Hackathon Projects.	Together, 200 AI and aerospace experts created an amazing array of tools, including AI flight planners, AI air traffic controllers, and Apple Vision Pro flight simulators, as a means of prototyping cutting-edge solutions for the aviation and space industries.
AI race heats up as OpenAI, Google, and Mistral release new models.	Launches within 12 hours of one another, and more activity expected in industry over summer
next-generation Meta Training and Inference Accelerator.	The next iteration of Meta's AI accelerator chip has been revealed. Its development was centered on throughput (11 TFLOPs at int8) and chip memory (128GB at 5nm).
Google’s Gemini Pro 1.5 enters public preview on Vertex AI.	Gemini 1.5 Pro, Google’s most capable generative AI model, is now available in public preview on Vertex AI, Google’s enterprise-focused AI development platform. The company announced the news during its annual Cloud Next conference, which is taking place in Las Vegas this week.
Microsoft is working on sound recognition AI technologies capable of detecting natural disasters.	However, the Redmond-based tech giant is working on performant sound recognition AI technologies that would see Copilot (and any other AI model, such as ChatGPT) capable of detecting upcoming natural disasters, such as earthquakes, and storms.
Amazon scrambles for its place in the AI race.	With its multibillion-dollar bet on Anthropic and its forthcoming Olympus model, Amazon is pushing hard to be a leader in AI.
Elon Musk's updated Grok AI claims to be better at coding and math.	It'll be available to early testers 'in the coming days.' Elon Musk's answer to ChatGPT is getting an update to make it better at math, coding and more. Musk's xAI has launched Grok-1.5 to early testers with "improved capabilities and reasoning" and the ability to process longer contexts. The company claims it now stacks up against GPT-4, Gemini Pro 1.5, and Claude 3 Opus in several areas.
Anthropic's Haiku Beats GPT-4 Turbo in Tool Use - Sometimes.	Anthropic's beta tool use API is better than GPT-4 Turbo in 50% of cases on the Berkeley Function Calling benchmark.
UK has real concerns about AI risks, says competition regulator.	Concentration of power among just six big tech companies ‘could lead to winner takes all dynamics’
New bill would force AI companies to reveal use of copyrighted art.	Adam Schiff introduces bill amid growing legal battle over whether major AI companies have made illegal use of copyrighted works
Randomness in computation wins computer-science ‘Nobel’.	Computer scientist Avi Wigderson is known for clarifying the role of randomness in algorithms, and for studying their complexity. A leader in the field of computational theory is the latest winner of the A. M. Turing Award, sometimes described as the ‘Nobel Prize’ of computer science.
Introducing Rerank 3: A New Foundation Model for Efficient Enterprise Search & Retrieval.	Rerank 3, the newest foundation model from Cohere, was developed with enterprise search and Retrieval Augmented Generation (RAG) systems in mind. The model may be integrated into any legacy program with built-in search functionality and is compatible with any database or search index. With a single line of code, Rerank 3 can improve search speed or lower the cost of running RAG applications with minimal effect on latency.
Meta to broaden labeling of AI-made content.	Meta admits its current labeling policies are "too narrow" and that a stronger system is needed to deal with today's wider range of AI-generated content and other manipulated content, such as a January video that appeared to show President Biden inappropriately touching his granddaughter.
Mistral's New Model.	The Mixtral-8x22B Large Language Model (LLM) is a pre-trained generative Sparse Mixture of Experts.
Waymo self-driving cars are delivering Uber Eats orders for the first time.	Uber Eats customers may now receive orders delivered by one of Waymo’s self-driving cars for the first time in the Phoenix metropolitan area. It is part of a multiyear collaboration between the two companies unveiled last year.
JetMoE: Reaching LLaMA2 Performance with 0.1M Dollars.	This model of a mixture of experts was trained on a decent amount of CPU power using available datasets. It performs on par with the considerably larger and more costly Meta Llama 2 7B variant.
Google blocking links to California news outlets from search results.	Tech giant is protesting proposed law that would require large online platforms to pay ‘journalism usage fee’
House votes to reapprove law allowing warrantless surveillance of US citizens.	Fisa allows for monitoring of foreign communications, as well as collection of citizens’ messages and calls
Tesla settles lawsuit over 2018 fatal Autopilot crash of Apple engineer.	Walter Huang was killed when his car steered into a highway barrier and Tesla will avoid questions about its technology in a trial

Resources

Link	description
swe agents.	SWE-agent turns LMs (e.g. GPT-4) into software engineering agents that can fix bugs and issues in real GitHub repositories.
Schedule-Free Learning.	Faster training without schedules - no need to specify the stopping time/steps in advance!
State-of-the-art Representation Fine-Tuning (ReFT) methods.	ReFT is a novel approach to language model fine-tuning that is efficient with parameters. It achieves good performance at a significantly lower cost than even PeFT.
The Top 100 AI for Work – April 2024.	Following our AI Top 150, we spent the past few weeks analyzing data on the top AI platforms for work. This report shares key insights, including the AI tools you should consider adopting to work smarter, not harder.
LLocalSearch.	LLocalSearch is a completely locally running search aggregator using LLM Agents. The user can ask a question and the system will use a chain of LLMs to find the answer. The user can see the progress of the agents and the final answer. No OpenAI or Google API keys are needed.
llm.c.	LLM training in simple, pure C/CUDA. There is no need for 245MB of PyTorch or 107MB of CPython. For example, training GPT-2 (CPU, fp32) is ~1,000 lines of clean code in a single file. It compiles and runs instantly, and exactly matches the PyTorch reference implementation.
AIOS: LLM Agent Operating System.	AIOS, a Large Language Model (LLM) Agent operating system, embeds a large language model into Operating Systems (OS) as the brain of the OS, enabling an operating system "with soul" -- an important step towards AGI. AIOS is designed to optimize resource allocation, facilitate context switch across agents, enable concurrent execution of agents, provide tool service for agents, maintain access control for agents, and provide a rich set of toolkits for LLM Agent developers.
Anthropic Tool use (function calling).	Claude AI may now communicate with customized client-side tools supplied in API requests thanks to the public beta that Anthropic has released. To utilize the feature, developers need to include the 'anthropic-beta: tools-2024-04-04' header. Provided that each tool has a comprehensive JSON structure, Claude's capability can be expanded.
Flyflow.	Flyflow is API middleware to optimize LLM applications, same response quality, 5x lower latency, security, and much higher token limits
ChemBench.	LLMs gain importance across domains. To guide improvement, benchmarks have been developed. One of the most popular ones is BIG-bench which currently only includes two chemistry-related tasks. The goal of this project is to add more chemistry benchmark tasks in a BIG-bench compatible way and develop a pipeline to benchmark frontier and open models.
Longcontext Alpaca Training.	On an H100, train more than 200k context windows using a new gradient accumulation offloading technique.
attorch.	attorch is a subset of PyTorch's NN module, written purely in Python using OpenAI's Triton. Its goal is to be an easily hackable, self-contained, and readable collection of neural network modules whilst maintaining or improving upon the efficiency of PyTorch.
Policy-Guided Diffusion.	A novel approach to agent training in offline environments is provided by policy-guided diffusion, which generates synthetic trajectories that closely match target policies and behavior. By producing more realistic training data, this method greatly enhances the performance of offline reinforcement learning models.
Ada-LEval.	Ada-LEval is a pioneering benchmark to assess the long-context capabilities with length-adaptable questions. It comprises two challenging tasks: TSort, which involves arranging text segments into the correct order, and BestAnswer, which requires choosing the best answer to a question among multiple candidates.

Perspectives

Link	description
"Time is running out": can a future of undetectable deep-fakes be avoided?.	Tell-tale signs of generative AI images are disappearing as the technology improves, and experts are scrambling for new methods to counter disinformation
Four Takeaways on the Race to Amass Data for A.I.	To make artificial intelligence systems more powerful, tech companies need online data to feed the technology. Here’s what to know.
TechScape: Could AI-generated content be dangerous for our health?	From hyperrealistic deep fakes to videos that not only hijack our attention but also our emotions, tech seems increasingly full of "cognito-hazards"
AI can help to tailor drugs for Africa — but Africans should lead the way.	Computational models that require very little data could transform biomedical and drug development research in Africa, as long as infrastructure, trained staff, and secure databases are available.
Breaking news: Scaling will never get us to AGI.	In order to create artificial general intelligence, additional methods must be used because neural networks' poor capacity to generalize beyond their training data limits their reasoning and trustworthiness.
Americans’ use of ChatGPT is ticking up, but few trust its election information.	It’s been more than a year since ChatGPT’s public debut set the tech world abuzz. And Americans’ use of the chatbot is ticking up: 23% of U.S. adults say they have ever used it, according to a Pew Research Center survey conducted in February, up from 18% in July 2023.
Can Demis Hassabis Save Google?	Demis Hassabis, the founder of DeepMind, is currently in charge of Google's unified AI research division and hopes to keep the tech behemoth ahead of the competition in the field with innovations like AlphaGo and AlphaFold. Notwithstanding the achievements, obstacles nonetheless exist in incorporating AI into physical goods and rivalry from organizations like OpenAI's ChatGPT. Having made a substantial contribution to AI, Hassabis now has to work within Google's product strategy in order to make use of DeepMind's research breakthroughs.
Is ChatGPT corrupting peer review? Telltale words hint at AI use.	A study of review reports identifies dozens of adjectives that could indicate text written with the help of chatbots.
AI-fuelled election campaigns are here — where are the rules?	Political candidates are increasingly using AI-generated ‘softfakes’ to boost their campaigns. This raises deep ethical concerns.
How to break big tech’s stranglehold on AI in academia.	Deep-learning artificial intelligence (AI) models have become an attractive tool for researchers in many areas of science and medicine. However, the development of these models is prohibitively expensive, owing mainly to the energy consumed in training them.
Ready or not, AI is coming to science education — and students have opinions.	As educators debate whether it’s even possible to use AI safely in research and education, students are taking a role in shaping its responsible use.
‘Without these tools, I’d be lost’: how generative AI aids in accessibility.	A rush to place barriers around the use of artificial intelligence in academia could disproportionately affect those who stand to benefit most.

Back to index

ML news: Week 1 - 7 April

Research

Link	description
TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes.	Scholars have unveiled a novel methodology for comprehending outside surroundings, surmounting challenges such as variable conditions and insufficient data that had hitherto impeded progress.
Lane-Change in Dense Traffic with Model Predictive Control and Neural Networks.	This work presents a control system that emphasizes collaboration with neighboring drivers to enable safe and seamless lane changes in congested traffic by combining AI and predictive algorithms.
Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs.	It is difficult to run language models on phones because of latency, bandwidth, and power limitations. This study demonstrates how to obtain 30 tokens/second generation for the potent Gemma 2B model using quantization, the removal of the kv cache, and other optimizations. This is about three times quicker than other frameworks.
Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models.	Sometimes, given an input image, Visual Language Models (VLMs) are unable to provide a response to a question. Even cutting-edge VLMs like GPT-4V have difficulties with this. This paper suggests some possible enhancements and a benchmark for VLMs that encounter intractable problems.
Total-Decom: Decomposed 3D Scene Reconstruction with Minimal Interaction.	With its revolutionary approach to 3D scene reconstruction, Total-Decom makes it simple to edit and manipulate photographs by precisely breaking down objects from several views with little effort on the part of the user.
Mechanism for feature learning in neural networks and backpropagation-free machine learning models.	proposed the deep neural feature ansatz, which states that neural feature learning occurs by up-weighting the features that are most influential on model output, a process that was formulated mathematically in terms of the average gradient outer product and was supported by numerical experiments and theoretical results. The presented mechanism provides a backpropagation-free approach for feature learning in various machine learning models, including those that previously had no such capabilities.
Teaching robots the art of human social synchrony.	Humanoid robots can now learn the art of social synchrony using neural networks.
Many-shot jailbreaking.	Anthropic created a method for breaking into lengthy context models. It has put these discoveries into practice and disseminated them to other organizations. This post describes the method and a few countermeasures that were implemented.
R2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding.	R2-Tuning is a technique created by researchers to comprehend videos by verbally cueing the system to recognize particular times.
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want.	SPHINX-V, a multimodal big language model developed as part of the Draw-and-Understand project, aims to improve human-AI interaction through visual cues.
RealKIE: Five Novel Datasets for Enterprise Key Information Extraction.	Enterprise AI solutions depend on the ability to extract information from datasets. It is possible to gauge general algorithmic performance for RAG applications using these five new benchmark datasets.
DiJiang: Efficient Large Language Models through Compact Kernelization.	Researchers have created a novel method called DiJiang that makes use of current Transformers to create faster, leaner models without requiring a significant amount of retraining.
WcDT: World-centric Diffusion Transformer for Traffic Scene Generation.	This paper presents a novel approach to autonomous vehicle driving path generation that integrates transformers and diffusion models into a system dubbed the "World-Centric Diffusion Transformer" (WcDT).
SeaBird: Segmentation in Bird's View with Dice Loss Improves Monocular 3D Detection of Large Objects.	In situations when conventional monocular detectors struggle to identify huge objects, a novel 3D detection technique called SeaBird succeeds.
Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models.	In order to evaluate if AI is capable of determining when an issue cannot be solved, this study presents the idea of Unsolvable Issue Detection (UPD) in Vision Language Models.
ASTRA - 3rd place solution for SoccerNet Action Spotting Challenge 2023.	ASTRA is a Transformer-based model that may overcome issues like as action localization and data imbalance and recognize important periods in soccer matches.
Multi-Granularity Guided Fusion-in-Decoder.	MGFiD introduces a multi-level evidence discernment strategy that improves the understanding and selection of pertinent information by question-answering systems.
Linear Attention Sequence Parallelism.	With its creative application of linear attention, Linear Attention Sequence Parallel (LASP) presents a novel approach to effectively handling lengthy sequences in language models, outperforming conventional techniques.
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models.	The fact that every token consumes the same amount of predictive computation is one disadvantage of contemporary transformers. But compared to other tokens, some are far simpler to predict. With this work, DeepMind has paved the way for dynamic computing with a limited maximum by allowing models to depart early in order to spend fewer flops on certain tokens. For the same performance, there are 50% fewer failures at generation time.
InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image Generation.	With InstantStyle, image personalization takes a new turn by addressing the issue of style consistency without requiring intricate fine-tuning. This framework guarantees precise and consistent visual stylization, merging style intensity with text management with a seamless integration of style-specific sections and a clever division of style and content in images.
T-GATE: Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models.	By splitting the process into parts for planning and revising, TGATE presents an effective method for creating visuals. By correcting some outputs early on, this strategy not only makes the creation process simpler but also surprisingly enhances image quality.

News

Link	description
Announcing Grok-1.5.	Grok-1.5 comes with improved reasoning capabilities and a context length of 128,000 tokens. Available on 𝕏 soon.
Microsoft & OpenAI planning $100 billion supercomputer Stargate AI.	According to a report by The Information, Microsoft and OpenAI are reportedly planning a joint data center project that could reach $100 billion in cost. The project is said to culminate in the launch of a massive artificial intelligence supercomputer named “Stargate” by 2028.
In One Key A.I. Metric, China Pulls Ahead of the U.S.: Talent.	China has produced a huge number of top A.I. engineers in recent years. New research shows that, by some measures, it has already eclipsed the United States.
Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters.	Compared to Qwen1.5-7B, which contains 6.5 billion non-embedding parameters, Qwen1.5-MoE-A2.7B contains only 2.0 billion non-embedding parameters, approximately one-third of Qwen1.5-7B’s size. Notably, it achieves a 75% decrease in training expenses and accelerates inference speed by a factor of 1.74, offering substantial improvements in resource utilization without compromising performance.
“The king is dead”—Claude 3 surpasses GPT-4 on Chatbot Arena for the first time.	Anthropic's Claude 3 is first to unseat GPT-4 for #1 since launch of Chatbot Arena in May '23.
Microsoft Copilot AI will soon run locally on PCs.	Microsoft's Copilot AI service is set to run locally on PCs, Intel told Tom's Hardware. The company also said that next-gen AI PCs would require built-in neural processing units (NPUs) with over 40 TOPS (trillion operations per second) of power — beyond the capabilities of any consumer processor on the market.
Navigating the Challenges and Opportunities of Synthetic Voices.	Using a 15-second audio sample, OpenAI's Voice Engine model creates speech that sounds like a speaker. Applications for it include support for non-verbal people, translation, and educational aids. Because of the possibility of abuse, OpenAI is deploying its technology cautiously.
Apple AI researchers boast useful on-device model that ‘substantially outperforms’ GPT-4.	Nevertheless, Apple forges ahead with the promise of AI. In a newly published research paper, Apple’s AI gurus describe a system in which Siri can do much more than try to recognize what’s in an image. The best part? It thinks one of its models for doing this benchmarks better than ChatGPT 4.0.
Introducing Bezi AI.	The capacity to ideate at the speed of thought with a limitless asset collection is a major turning point in the field of 3D design.
Robot, can you say ‘Cheese’?	Columbia engineers build Emo, a silicon-clad robotic face that makes eye contact and uses two AI models to anticipate and replicate a person’s smile before the person actually smiles -- a major advance in robots predicting human facial expressions accurately, improving interactions, and building trust between humans and robots.
Billie Eilish, Nicki Minaj, Stevie Wonder, and more musicians demand protection against AI.	Letter signed by more than 200 artists makes broad ask that tech firms pledge to not develop AI tools to replace human creatives
US and UK announce formal partnership on artificial intelligence safety.	Countries sign memorandum to develop advanced AI model testing amid growing safety concerns
OpenAI deems its voice cloning tool too risky for general release.	Delaying the Voice Engine technology rollout minimizes the potential for misinformation in an important global election year
DrugGPT: new AI tool could help doctors prescribe medicine in England.	New tool may offer prescription ‘safety net’ and reduce the 237m medication errors made each year in England
New York City to test AI-enabled gun scanners in the subway system.	Mayor Eric Adams announced the pilot program as part of an effort to deter violence, with plans to evaluate scanners at some stations
Twitter usage in the US ‘fallen by a fifth’ since Elon Musk’s takeover.	App users for a social media site, rebranded as X, down by 23% since November 2022 according to Sensor Tower
Scientists turn to AI to make beer taste even better.	Researchers in Belgium use artificial intelligence to improve taste, but say the skill of the brewer remains vital
Google AI could soon use a person’s cough to diagnose disease.	Machine-learning system trained on millions of human audio clips shows promise for detecting COVID-19 and tuberculosis.
Microsoft is working on an Xbox AI chatbot.	Xbox employees have been testing a virtual chatbot that can help with support queries and game refunds.
Sam Altman gives up control of OpenAI Startup Fund, resolving unusual corporate venture structure.	OpenAI CEO Sam Altman has transferred formal control of the eponymously firm’s named corporate venture fund to Ian Hathaway, OpenAI confirmed to TechCrunch.
You can now use ChatGPT without an account.	On Monday, OpenAI began opening up ChatGPT to users without an account. It described the move as part of its mission to “make tools like ChatGPT broadly available so that people can experience the benefits of AI.” It also gives the company more training data (for those who don’t opt out) and perhaps nudges more users into creating accounts and subscribing for superior GPT-4 access instead of the older GPT-3.5 model free users get.
GENERATIVE SF: MARKETPLACES IN AI EDITION.	How Instacart and Faire use AI to boost productivity and better serve their customers.
Replit launches new product in race for AI coding assistants.	A Silicon Valley AI coding startup is launching a new tool that it hopes will change the way companies develop software. Replit, valued at over $1 billion and backed by venture firms like Andreessen Horowitz and Khosla Ventures, says its new product, called Replit Teams, will allow developers to collaborate in real-time on software projects while an AI agent automatically fixes coding errors.
Samsung might ‘redefine’ Bixby with Galaxy AI after all.	Samsung’s big Galaxy AI push this year skipped over its voice assistant, Bixby, but that might not be forever. Earlier this year when Galaxy AI made its debut, Samsung confirmed that Bixby wasn’t going away, but that it also didn’t really have plans for any new AI features within the voice assistant. Speaking to CNBC more recently, though, Samsung is looking at changing that.
George Carlin’s estate settles lawsuit over comedian’s AI doppelganger.	Suit claimed Dudesy podcast violated Carlin’s copyright, calling it ‘a casual theft of a great American artist’s work’
Opera allows users to download and use LLMs locally.	Web browser company Opera announced today it will now allow users to download and use large language models (LLMs) locally on their computer. This feature is first rolled out to Opera One users who get developer stream updates and will allow users to select from over 150 models from more than 50 families.
Introducing Stable Audio 2.0.	Stable Audio 2.0 sets a new standard in AI-generated audio, producing high-quality, full tracks with coherent musical structures up to three minutes in length at 44.1kHz stereo. The new model introduces audio-to-audio generation by allowing users to upload and transform samples using natural language prompts. Stable Audio 2.0 was exclusively trained on a licensed dataset from the AudioSparx music library, honoring opt-out requests and ensuring fair compensation for creators.
Scientists create AI models that can talk to each other and pass on skills with limited human input.	Scientists modeled human-like communication skills and the transfer of knowledge between AIs — so they can teach each other to perform tasks without a huge amount of training data.
Worldcoin Foundation open sources core components of the Orb’s software.	For the Worldcoin Orb, Tools for Humanity has created a robust and safe computing environment that makes use of Arm Cortex M4 microcontrollers for real-time operations and NVIDIA Jetson for processing. The Orb does neural network inference using NVIDIA's TensorRT and runs Rust applications. It runs on Orb OS, a customized GNU/Linux distribution with an emphasis on security. For cryptography, the system incorporates a secure element, and for backend authentication, it provides trusted execution environments.
Report: Google might make SGE a paid feature, not working on ad-free Search.	As the Search Generative Experience (SGE) nears its one-year anniversary, Google is reportedly considering making it a paid feature, but is not considering an ad-free offering.
Lambda Announces $500M GPU-Backed Facility to Expand Cloud for AI.	Lambda, the GPU cloud company founded by AI engineers and powered by NVIDIA GPUs, today announced that it has secured a special purpose GPU financing vehicle of up to $500 million to fund the expansion of its on-demand cloud offering.
OpenAI expands its custom model training program.	OpenAI is expanding a program, Custom Model, to help enterprise customers develop tailored generative AI models using its technology for specific use cases, domains, and applications.
Former Snap AI chief launches Higgsfield to take on OpenAI’s Sora video generator.	OpenAI captivated the tech world a few months back with a generative AI model, Sora, that turns scene descriptions into original videos — no cameras or film crews required. But Sora has so far been tightly gated, and the firm seems to be aiming it toward well-funded creatives like Hollywood directors — not hobbyists or small-time marketers, necessarily.
Tesla Raising Pay for AI Engineers To Counter Poaching, Musk Says.	Tesla is raising pay for its artificial intelligence (AI) engineers as it fends off poaching from the likes of OpenAI, Chief Executive Officer (CEO) Elon Musk said in a series of posts on X. The plan to boost the pay of AI staff comes as the talent wars for people well-versed in the technology heats up.
YouTube Says OpenAI Training Sora With Its Videos Would Break Rules.	The use of YouTube videos to train OpenAI’s text-to-video generator would be an infraction of the platform's terms of service, YouTube Chief Executive Officer Neal Mohan said.
AI-generated YC Demo Day video.	AI was utilized by a team from the latest YC cohort to create their demo day video. This is an unprecedented action taken by a firm.

Resources

Link	description
Your AI Product Needs Evals.	How to construct domain-specific LLM evaluation systems. This post outlines my thoughts on building evaluation systems for LLMs-powered AI products.
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild.	VoiceCraft is a token-infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on in-the-wild data including audiobooks, internet videos, and podcasts.To clone or edit an unseen voice, VoiceCraft needs only a few seconds of reference.
Interrupting Cow.	Interruptions make conversations feel natural. Much work has focused on AI voice assistants that can be interrupted by humans, but systems that know much more than us should be able to interrupt us too.
EvoEval: Evolving Coding Benchmarks via LLM.	With the help of a new benchmark suite called EvoEval, Large Language Models' coding prowess is put to the ultimate test.
Optimum-NVIDIA.	Optimum-NVIDIA delivers the best inference performance on the NVIDIA platform through Hugging Face. Run LLaMA 2 at 1,200 tokens/second (up to 28x faster than the framework) by changing just a single line in your existing transformers' code.
OpenUI.	Building UI components can be a slog. OpenUI aims to make the process fun, fast, and flexible. It's also a tool we're using at W&B to test and prototype our next-generation tooling for building powerful applications on top of LLM's.
openchat-3.5-0106-gemma.	The highest performing Gemma model in the world. Trained with OpenChat's C-RLFT on openchat-3.5-0106 data. Achieving similar performance to Mistral-based openchat, and much better than Gemma-7b and Gemma-7b-it.
Generative AI for Beginners (Version 2) - A Course.	Microsoft's well-liked course on low-code apps, prompting, vector databases, and LLMs is available on GitHub in version 2. There are eighteen lessons in it. Even though some of the material is aspirational, it's still a useful starting point for the industry.
Industry Documents Library (IDL).	A huge dataset of 26m pages and 18B tokens of extremely high-quality OCR’d dataset of industrial PDF documents.
SWE-agent.	SWE-agent turns LMs (e.g. GPT-4) into software engineering agents that can fix bugs and issues in real GitHub repositories.
chug.	A library to help with efficient training for multi-modal data. Initially focused on image & document + text tasks. Minimal sharded dataset loaders, decoders, and utils for multi-modal document, image, and text datasets.
Cosmopedia: how to create large-scale synthetic data for pre-training.	The HuggingFace group demonstrates how to create synthetic data for language model pre-training by seeding, synthesizing, filtering, and scaling.
AutoQuant.	HuggingFace models can be exported from this notebook into the following five quantization formats: GGUF, GPTQ, EXL2, AWQ, and HQQ.
AI Infrastructure Explained.	Innovative applications of AI have captured the public’s imagination over the past year and a half. What’s less appreciated or understood is the infrastructure powering these AI-enabled technologies. But as foundational models get more powerful, we’ll need a strong technology stack that balances performance, cost, and security to enable widespread AI adoption and innovation.
Introducing world's largest synthetic open-source Text-to-SQL dataset.	HuggingFace currently has 23 million text-to SQL tokens ready for use. In order to assist in producing SQL queries based on tasks involving natural language, Gretel has gathered a sizable dataset. This can support the creation of synthetic data as well as RAG applications.
Write OpenAPI with TypeSpec.	Compared to JSON or YAML, TypeSpec, an API specification language created at Microsoft, provides a more succinct and understandable format for writing OpenAPI. It solves the verbosity and lack of reusable components in OpenAPI by allowing the specification of API patterns as reusable components, which streamlines code production and governance at scale. This is done by drawing inspiration from TypeScript's syntax. The flexibility and productivity gains of TypeSpec may increase the appeal of developing applications using APIs first.

Perspectives

Link	description
How Autonomous Racing Is Pushing Self-Driving Cars Forward.	The gritty reality of racing without drivers teaches us a lot about the future of autonomous cars.
Does AI need a “body” to become truly intelligent? Meta researchers think so.	AIs that can generate videos, quickly translate languages or write new computer code could be world-changing, but can they ever be truly intelligent? Not according to the embodiment hypothesis, which argues that human-level intelligence can only emerge if intelligence is able to sense and navigate a physical environment, the same way babies can.
Nobody Knows How to Safety-Test AI.	In line with government goals, Beth Barnes' NGO METR is working with prominent AI firms like OpenAI and Anthropic to create safety checks for sophisticated AI systems. The emphasis is on evaluating hazards, including AI autonomy and self-replication, with the understanding that safety assessments are still in their infancy and cannot ensure AI safety. Despite worries that the existing testing could not be sufficiently trustworthy to support the rapid progress of AI technologies, METR's work is viewed as pragmatic.
Beyond RPA: How LLMs are ushering in a new era of intelligent process automation.	RPA failed to achieve the enterprise-wide deployments that were anticipated, notwithstanding a few early triumphs. Only 3% of businesses were able to successfully grow their RPA operations, according to a Deloitte report. Recent developments in AI have the potential to alter this. Because of its innovative features, LLMs are expected to drive at least a tenfold increase in market share for intelligent process automation over the next ten years.
We’re Focusing on the Wrong Kind of AI Apocalypse.	When talking about AI's future, people frequently discuss dystopian scenarios rather than the present effects on jobs and misinformation. Instead of bringing about the end of the world, AI has the ability to change work into more fulfilling and productive tasks with careful integration.
How did a small developer of graphics cards for gamers suddenly become the third most valuable firm on the planet?	By turning his computer chip-making company Nvidia into a vital component in the AI arms race, Jensen Huang has placed himself at the forefront of the biggest gold rush in tech history
‘It’s very easy to steal someone’s voice’: how AI is affecting video game actors.	The increased use of AI to replicate the voice and movements of actors has benefits but some are concerned over how and when it might be used and who might be left short-changed
AI in Africa: Basics Over Buzz.	AI’s transformative power is its utility for virtually every economic sector. However, nearly half of the population in sub-Saharan Africa lacks access to electricity, and businesses struggle under the burden of an electricity supply that is among the most expensive and unreliable on earth.
How scientists are making the most of Reddit.	As X wanes, researchers are turning to Reddit for insights and data, and to better connect with the public.
Can lessons from infants solve the problems of data-greedy AI?	Words and images experienced by an infant wearing sensors during their daily life have led to efficient machine learning, pointing to the power of multimodal training signals and the potentially exploitable statistics of real-life experience.
Full Steam Ahead: The 2024 MAD (Machine Learning, AI & Data) Landscape.	This is our tenth annual landscape and “state of the union” of the data, analytics, machine learning, and AI ecosystem.
Building AI Models is faster and cheaper than you probably think.	By training or optimizing their foundation models with YC's assistance, YC companies are dispelling the myth that creating AI models takes enormous resources. In just three months, they have accomplished amazing feats like creating original proteins and producing music of a high caliber. These 25 firms have produced creative AI solutions in a variety of industries by utilizing YC's finance and technical capabilities. They show that smaller teams can achieve major improvements in AI through creativity and strategic insights.
Chinese mourners turn to AI to remember and ‘revive’ loved ones.	Growing interest in services that create digital clones of the dead as millions visit graves this week for tomb-sweeping festival
When Will the GenAI Bubble Burst?	That generative AI might not live up to expectations. The unprofitability of the technology, security flaws, and the innate issue of hallucinations in language models are all causes for concern. The excitement around generative AI may begin to fade unless a ground-breaking model such as GPT-5 is published by the end of 2024, addressing important difficulties and providing a game-changing application.
Inside the shadowy global battle to tame the world's most dangerous technology.	This article explores the intricate global attempts to control artificial intelligence (AI), which is considered to be one of the most powerful and dangerous technologies of our day.
How to win at Vertical AI.	Vertical B2B applications, where AI agents and open APIs play a critical role in rebundling and generating new business value, are where artificial intelligence truly shines. Domain-specific models provide vertical AI with an advantage in the near term, but horizontal integration into larger ecosystems is necessary for long-term success. AI agents make it possible to rebundle workflows, which transforms management procedures and gives businesses new competitive advantages across a range of industries.
Where AI Thrives, Religion May Struggle.	According to a study headed by Adam Waytz and Joshua Conrad Jackson, there may be a correlation between a drop in religious beliefs and growing exposure to robotics and AI. Higher robotization countries have higher declines in religiosity. According to the study, those whose occupations involved a lot of AI had a much lower likelihood of believing in God. These associations suggest that automation technologies could have an impact on the loss of religion.

Back to index

ML news: Week 25 - 31 March

Research

Link	description
Mora: Enabling Generalist Video Generation via A Multi-Agent Framework.	This paper introduces Mora, a new multi-agent framework designed to close the gap in the field of generalist video generation, mimicking the capabilities of the leading model, Sora, across a range of tasks including text-to-video and video editing. Despite achieving performance close to Sora in various tasks, Mora still faces a holistic performance gap, marking a step towards future advancements in collaborative AI agents for video generation.
Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models.	Text-to-image diffusion models such as Stable Diffusion are altered by Open-Vocabulary Attention Maps (OVAM), which overcome earlier restrictions by enabling the creation of attention maps for any word.
HETAL: Efficient Privacy-preserving Transfer Learning with Homomorphic Encryption.	Securing data privacy with Homomorphic Encryption, HETAL's novel method of transfer learning represents a major advancement in safe AI training.
HAC: Hash-grid Assisted Context for 3D Gaussian Splatting Compression.	This paper presents the Hash-grid Assisted Context (HAC) framework, which outperforms existing standards by achieving over 75X compression of 3D Gaussian Splatting (3DGS) data.
Shadow Generation for Composite Image Using Diffusion model.	This work overcomes earlier difficulties with form and intensity accuracy to present a novel approach to producing realistic shadows in picture composition. The addition of intensity modulation modules to ControlNet and the expansion of the DESOBA dataset allowed the researchers to achieve a considerable improvement in shadow production in pictures.
View-decoupled Transformer for Person Re-identification under Aerial-ground Camera Network.	The View-Decoupled Transformer (VDT) was created by researchers to address the problem of detecting subjects from disparate camera perspectives, such as those obtained from ground and aerial cameras.
ElasticDiffusion: Training-free Arbitrary Size Image Generation.	Text-to-image diffusion models can now generate images in different sizes and aspect ratios without the need for extra training thanks to ElasticDiffusion, an inventive decoding technique.
PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model.	The Large Multi-modal Model (LMM) is extended by PSALM, which adds a mask decoder and a flexible input schema to perform well in a range of picture segmentation tasks. This method not only gets beyond the drawbacks of text-only outputs, but also makes it possible for the model to comprehend and categorize complicated images with ease.
Compositional Inversion for Stable Diffusion Models.	In order to solve overfitting problems, researchers have devised a novel technique to enhance the way AI generates individualized visuals. This method guarantees that the thoughts are represented in the images in a more varied and balanced manner.
Residual Dense Swin Transformer for Continuous Depth-Independent Ultrasound Imaging.	With arbitrary-scale super-resolution, RDSTN is a novel network that addresses the trade-off between field-of-view and picture quality in ultrasound imaging.
UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity.	A new standard for text-based person retrieval is UFineBench. To aid AI in comprehending and locating persons in photos, it makes use of thorough descriptions.
SegRefiner: Towards Model-Agnostic Segmentation Refinement with Discrete Diffusion Process.	By understanding refinement as a data creation process, SegRefiner is a novel model-agnostic approach that enhances object mask quality in a variety of segmentation applications. Through the use of a discrete diffusion method, it fine-tunes coarse masks pixel by pixel, improving border metrics and segmentation.
VMRNN: Integrating Vision Mamba and LSTM for Efficient and Accurate Spatiotemporal Forecasting.	Our suggestion is the VMRNN cell, a novel recurrent unit that combines the advantages of LSTM and Vision Mamba blocks. Our comprehensive tests demonstrate that, despite retaining a reduced model size, our suggested strategy achieves competitive outcomes on a range of pivot benchmarks.
Salience-DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement.	In order to balance computing economy and accuracy, this research presents Salience DETR, which uses hierarchical salience filtering to improve query selection in object identification.
Universal Cell Embeddings: A Foundation Model for Cell Biology.	We present the Universal Cell Embedding (UCE) foundation model. UCE was trained on a corpus of cell atlas data from humans and other species in a completely self-supervised way without any data annotations. UCE offers a unified biological latent space that can represent any cell, regardless of tissue or species. This universal cell embedding captures important biological variation despite the presence of experimental noise across diverse datasets.
AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation.	Using just one reference image and voice input, the AniPortrait framework can produce realistic animated portraits. This technique creates animations that are exceptional in terms of authentic facial expressions, a variety of poses, and great visual quality by first converting audio into 3D representations and then mapping them onto 2D facial landmarks.
PAID: (Prompt-guided) Attention Interpolation of Text-to-Image.	Two methods, AID and its version PAID are intended to enhance image interpolation by the incorporation of text and pose conditions. Without the need for further training, these techniques guarantee the creation of images with improved consistency, smoothness, and fidelity.
The Need for Speed: Pruning Transformers with One Recipe.	With the help of the OPTIN framework, transformer-based AI models can now be more effective across a range of domains without requiring retraining. Through the use of an intermediate feature distillation technique, OPTIN is able to compress networks under certain conditions with minimal impact on accuracy.
Long-form factuality in large language models.	Factual information can be produced through the use of language models. Google has made available benchmarks and a dataset that demonstrate the performance of each model. This research demonstrates that language models outperform human annotators in most situations and offers advice on how to enhance a model's factuality.
CoDA: Instructive Chain-of-Domain Adaptation with Severity-Aware Visual Prompt Tuning.	A novel method for Unsupervised Domain Adaptation (UDA) is called CoDA. It learns from variances at both the scene and image levels, which aids AI models in becoming more adaptive to unlabeled, difficult settings.
Backtracing: Retrieving the Cause of the Query.	This method finds the precise content—from lectures to news articles—that prompts users to ask questions online. Backtracking is a technique that seeks to assist content producers in improving their work by locating and comprehending the reasons for misunderstandings, inquisitiveness, or emotional responses.
CT-CLIP.	A foundation model utilizing chest CT volumes and radiology reports for supervised-level zero-shot detection of abnormalities

News

Link	description
Stability AI CEO resigns to ‘pursue decentralized AI’.	Emad Mostaque’s resignation comes after key departures at the AI startup. And here the Company announcement.
GTC Wrap-Up: ‘We Created a Processor for the Generative AI Era,’ NVIDIA CEO Says.	Kicking off the biggest GTC conference yet, NVIDIA founder and CEO Jensen Huang unveils NVIDIA Blackwell, NIM microservices, Omniverse Cloud APIs, and more.
After raising $1.3B, Inflection is eaten alive by its biggest investor, Microsoft.	In June 2023, Inflection announced it had raised $1.3 billion to build what it called “more personal AI.” The lead investor was Microsoft. Today, less than a year later, Microsoft announced that it was feasting on Inflection’s body and sucking the marrow from the bones (though I think they phrased it differently).
OpenAI is pitching Sora to Hollywood.	The AI company is scheduled to meet with a number of studios, talent agencies, and media executives in Los Angeles next week to discuss partnerships, sources familiar with the matter told Bloomberg.
GitHub’s latest AI tool can automatically fix code vulnerabilities.	It’s a bad day for bugs. Earlier today, Sentry announced its AI Autofix feature for debugging production code and now, a few hours later, GitHub is launching the first beta of its code-scanning autofix feature for finding and fixing security vulnerabilities during the coding process.
Researchers gave AI an 'inner monologue' and it massively improved its performance.	Scientists trained an AI system to think before speaking with a technique called QuietSTaR. The inner monologue improved common sense reasoning and doubled math performance.
a California city is training AI to spot homeless encampments.	For the last several months, a city at the heart of Silicon Valley has been training artificial intelligence to recognize tents and cars with people living inside in what experts believe is the first experiment of its kind in the United States.
Sora: First Impressions.	A compilation of Sora content generated by visual artists, designers, creative directors, and filmmakers.
Open Interpreter O1 Light.	A portable speech interface that manages your home computer is called the 01 Light. It can utilize your applications, view your screen, and pick up new abilities. The open-source 01 serves as the basis for a new generation of AI gadgets.
Character Voice For Everyone.	Character Voice is a set of capabilities that elevates the Character.AI experience by enabling users to hear Characters conversing with them one-on-one. The company's bigger goal is to create a multimodal interface that will enable more smooth, simple, and interesting interactions. This is the first step toward that goal.
Cerebras Systems Unveils World’s Fastest AI Chip with Whopping 4 Trillion Transistors.	The 24T parameter language models may be trained using Cerebras' new wafer chip. PyTorch is supported natively.
The GPT-4 barrier has finally been broken.	Four weeks ago, GPT-4 remained the undisputed champion: consistently at the top of every key benchmark, but more importantly the clear winner in terms of “vibes”. Today that barrier has finally been smashed. We have four new models, all released to the public in the last four weeks, that are benchmarking near or even above GPT-4.
China puts trust in AI to maintain largest high-speed rail network on Earth.	The railway system is in better condition than when it was first built, according to peer-reviewed paper. Vast amounts of real-time data are processed by an artificial intelligence system in Beijing to identify problems before they arise, the engineers say
Microsoft to hold a special Windows and Surface AI event in May.	Ahead of Build 2024, Microsoft CEO Satya Nadella will share the company’s ‘AI vision’ for both software and hardware.
AI ‘apocalypse’ could take away almost 8m jobs in the UK, says the report.	Women, younger workers and lower paid are at most risk from artificial intelligence, says IPPR thinktank
Elon Musk says all Premium subscribers on X will gain access to AI chatbot Grok this week.	Following Elon Musk’s xAI’s move to open source its Grok large language model earlier in March, the X owner on Tuesday said that the company formerly known as Twitter will soon offer the Grok chatbot to more paying subscribers.
OpenAI’s chatbot store is filling up with spam.	TechCrunch found that the GPT Store, OpenAI’s official marketplace for GPTs, is flooded with bizarre, potentially copyright-infringing GPTs that imply a light touch where it concerns OpenAI’s moderation efforts.
Apple's big WWDC 2024 announcement may be an AI App Store.	Apple's AI strategy may not necessarily be to only offer the best AI apps it can produce, but instead deliver an enhanced AI App Store that may debut at WWDC.
Mathematicians use AI to identify emerging COVID-19 variants.	Scientists at The Universities of Manchester and Oxford have developed an AI framework that can identify and track new and concerning COVID-19 variants and could help with other infections in the future.
iOS 18 Reportedly Won't Feature Apple's Own ChatGPT-Like Chatbot.	Bloomberg's Mark Gurman today reported that Apple is not planning to debut its own generative AI chatbot with its next major software updates, including iOS 18 for the iPhone. Instead, he reiterated that Apple has held discussions with companies such as Google, OpenAI, and Baidu about potential generative AI partnerships.
Introducing DBRX: A New State-of-the-Art Open LLM.	DBRX, an open, general-purpose LLM created by Databricks. Across a range of standard benchmarks, DBRX sets a new state-of-the-art for established open LLMs.
Amazon invests another $2.75B in Anthropic — reportedly ‘largest’ in company history.	Today, Amazon announced it has finalized that investment at the full planned amount, putting in another $2.75 billion atop the $1.25 billion it originally committed last year. According to CNBC, it is Amazon’s “largest venture investment yet.”
OpenAI Is Starting To Test GPT Earning Sharing.	We’re partnering with a small group of US builders to test usage-based GPT earnings. Our goal is to create a vibrant ecosystem where builders are rewarded for their creativity and impact and we look forward to collaborating with builders on the best approach to get there.
Nvidia Tops MLPerf’s Inferencing Tests.	Now that we’re firmly in the age of massive generative AI, it’s time to add two such behemoths, Llama 2 70B and Stable Diffusion XL, to MLPerf’s inferencing tests. Version 4.0 of the benchmark tests more than 8,500 results from 23 submitting organizations. As has been the case from the beginning, computers with Nvidia GPUs came out on top, particularly those with its H200 processor. But AI accelerators from Intel and Qualcomm were in the mix as well.
AI21 releases Jamba Language Model.	The Mamba model style is designed to outperform Transformers in terms of efficiency while maintaining performance parity. One new version with MoE layers is Jamba. With a context length of 128k tokens, it can operate at 1.6k tokens per second. It performs 67% on the benchmark for MMLU. There are weights available.
Hume introduces Empathic Voice Interface.	Meet Hume’s Empathic Voice Interface (EVI), the first conversational AI with emotional intelligence.
Google starts testing AI overviews from SGE in main Google search interface.	Google is now testing AI overviews in the main Google Search results, even if you have not opted into the Google Search Generative Experience labs feature. Google said this is an experience on a “subset of queries, on a small percentage of search traffic in the U.S.,” a Google spokesperson told Search Engine Land.
LLaVA-HR: High-Resolution Large Language-Vision Assistant .	This repository contains the implementation of LLaVA-HR, a strong and efficient MLLM powered by our mixture-of-resolution adaptation.
Meta is adding AI to its Ray-Ban smart glasses next month.	The Ray-Ban Meta Smart Glasses can do things like identify objects, monuments, and animals, as well as translate text.
Google bringing Gemini Nano to Pixel 8 with next Feature Drop.	The Pixel 8 will get Gemini Nano, in developer preview, to power Summarize in Recorder and Gboard Smart Reply. The latter allows for “higher-quality smart replies” that have “conversational awareness” and should be generated faster. On the Pixel 8 Pro, it works with WhatsApp, Line, and KakaoTalk. Meanwhile, Summarize can take a recording and generate bullet points.

Resources

Link	description
Building and testing C extensions for SQLite with ChatGPT Code Interpreter.	This essay goes into great detail on how to create code in a foreign language for a difficult task using ChatGPT (or any other language model). Its creator writes, compiles, and downloads new bindings for the well-known database SQLite using ChatGPT's code interpreter.
Official Mistral Fine-tuning Code.	A hackathon was recently organized by Mistral. The business also published code for optimizing its language models along with version 0.2 of the 7B model. The coding is clear and easy to read.
Scalable Optimal Transport.	A curated list of research works and resources on optimal transport in machine learning.
AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation.	AdaIR presents an all-in-one image restoration network that addresses several types of picture deterioration such as noise, blur, and haze by using frequency mining and modulation.
Turbocharged Training: Optimizing the Databricks Mosaic AI stack with FP8.	The group at Databricks Mosaic has persisted in advancing language model training. They talk about the fp8 training stack and the potential advantages of decreasing precision in this post.
Low-latency Generative AI Model Serving with Ray, NVIDIA Triton Inference Server, and NVIDIA TensorRT-LLM.	A new collaboration between Anyscale and NVIDIA will allow users to scale generative AI models into production. Customers can enhance resource management, observability, and autoscaling by utilizing the combined capabilities of Anyscale's managed runtime environment and Ray through this integration.
Discover The Best AI Websites & Tools.	11006 AIs and 233 categories in the best AI tools directory. AI tools list & GPTs store are updated daily by ChatGPT.
codel.	Fully autonomous AI Agent that can perform complicated tasks and projects using a terminal, browser, and editor.
binary vector search is better than your FP32 vectors.	A crucial component of RAG pipelines is searching over embedding vectors. You may retain performance while reducing memory needs by 30x by substituting a single 0 or 1 for the fp32 numbers, followed by a KNN clustering and reranked.
Deepfake Generation and Detection: A Benchmark and Survey.	This thorough analysis explores the developments and difficulties around deepfake technology and its detection, emphasizing the arms race between those who produce deepfakes and those who are creating systems to identify them.
Evaluate LLMs in real-time with Street Fighter III.	Make LLMs fight each other in real-time in Street Fighter III. Each player is controlled by an LLM. We send to the LLM a text description of the screen. The LLM decides on the next moves its character will make. The next moves depend on its previous moves, the moves of its opponents, its power, and health bars.
Superpipe.	Superipe is a lightweight framework to build, evaluate and optimize LLM pipelines for structured outputs: data labeling, extraction, classification, and tagging. Evaluate pipelines on your own data and optimize models, prompts, and other parameters for the best accuracy, cost, and speed.

Perspectives

Link	description
How People Are Really Using GenAI.	There are many use cases for generative AI, spanning a vast number of areas of domestic and work life. Looking through thousands of comments on sites such as Reddit and Quora, the author’s team found that the use of this technology is as wide-ranging as the problems we encounter in our lives. The 100 categories they identified can be divided into six top-level themes, which give an immediate sense of what generative AI is being used for: Technical Assistance & Troubleshooting (23%), Content Creation & Editing (22%), Personal & Professional Support (17%), Learning & Education (15%), Creativity & Recreation (13%), Research, Analysis & Decision Making (10%).
Untangling concerns about consolidation in AI.	Microsoft's recent acquisition of Inflection's talent sparked discussions about the largest tech giants having too much influence over AI research and development. Although they have the resources to work quickly on basic language models, there are legitimate concerns that the concentration of power would stifle transparency and innovation. This article examines the intricate trade-offs that arise as artificial intelligence becomes more widely used.
‘A landmark moment’: scientists use AI to design antibodies from scratch.	Modified protein-design tool could make it easier to tackle challenging drug targets — but AI antibodies are still a long way from reaching the clinic.
TechScape: Is the US calling time on Apple’s smartphone domination?	The tech giant fights regulators on both sides of the Atlantic, as the US government launches a grab-bag of accusations. Plus, Elon Musk’s bad day in court
Go, Python, Rust, and production AI applications.	The roles of Python, Go, and Rust in developing AI applications are covered in this article: Go is used for larger-scale production, Python is used for developing AI models, and Rust is used for tasks requiring high performance. It highlights the significance of choosing the appropriate language for the task based on the ecosystem and tool fit, speculating that Go may replace Python as the production language. The author promotes connecting the Go and Python communities to improve the development of AI applications.
Trends in Synthetic Biology & AI in Drug Discovery in 2024.	2024 promises to be a historic year for artificial intelligence in drug discovery, with significant progress being made in synthetic biology. The synthesis of modular biological components and the impact of generative AI on research are two prominent themes that are highlighted in this article. The entry of Insilico Medicine's AI-powered candidate into Phase II clinical trials demonstrates how the combination of artificial intelligence and synthetic biology is speeding up the drug discovery process.
LLMs have special intelligence, not general, and that's plenty.	In sophisticated cognitive tests, Anthropic's new AI model Claude-3 performs better than other models, including GPT-4, and above the average human IQ. Even with this success, Claude-3 still finds it difficult to solve simple puzzles and other basic tasks that people take for granted. Rather than having general intelligence like that of humans, LLMs can have a "Special Intelligence." They can be creatively reflecting back to us what they know.
AI SaaS Companies Will Be More Profitable.	The deflationary impacts of AI in marketing, sales, operations, and software development could mean that while AI software companies may initially incur higher costs, they could end up being more profitable than traditional SaaS companies.
AI image generators often give racist and sexist results: can they be fixed?	Researchers are tracing sources of racial and gender bias in images generated by artificial intelligence, and making efforts to fix them.
How AI is improving climate forecasts.	Researchers are using various machine-learning strategies to speed up climate modelling, reduce its energy costs and hopefully improve accuracy.
Here’s why AI search engines really can’t kill Google.	The AI search tools are getting better — but they don’t yet understand what a search engine really is and how we really use them.
Inside the shadowy global battle to tame the world's most dangerous technology.	The problem of controlling AI is one that the world is now facing. Global leaders, tech executives, and legislators convened many high-profile meetings and conferences that exposed disagreements and differences over how to regulate this game-changing technology.
Hackers can read private AI-assistant chats even though they’re encrypted.	All non-Google chat GPTs affected by side channel that leaks responses sent to users.
Towards 1-bit Machine Learning Models.	Recent works on extreme low-bit quantization such as BitNet and 1.58 bit have attracted a lot of attention in the machine learning community. The main idea is that matrix multiplication with quantized weights can be implemented without multiplications, which can potentially be a game-changer in terms of compute efficiency of large machine learning models.
AI escape velocity.	The law of accelerating returns, which holds that progress is made at an exponential pace over time, was created by AI futurist Ray Kurzweil. Kurzweil covered a wide range of subjects in a recent talk, such as prospects that are only going to get better, the future of the AI economy, human relationships with AIs, lifespan escape velocity, and much more.
Plentiful, high-paying jobs in the age of AI.	Experts in AI are investigating automating human functions, raising fears about job losses and declining wages. The belief that advances in AI would eventually render human labor obsolete, however, may not be accurate. Constraints like computer power and opportunity costs may mean that humans will still have jobs in an AI-dominated future, but this is not a given.

Back to index

ML news: Week 18 - 24 March

Research

Link	description
ScoreHMR: Score-Guided Diffusion for 3D Human Recovery.	We present Score-Guided Human Mesh Recovery (ScoreHMR), an approach for solving inverse problems for 3D human pose and shape reconstruction. ScoreHMR mimics model fitting approaches, but alignment with the image observation is achieved through score guidance in the latent space of a diffusion model. Here, we show the application of our approach on videos, utilizing keypoint detections and score guidance with keypoint reprojection and temporal smoothness terms.
Cappy: Outperforming and boosting large multi-task language models with a small scorer.	A little model called Cappy has been taught to accept instructions and a candidate's completion, then calculate how well the completion satisfies the instructions by returning a score. It performs better on this job than significantly bigger models, indicating that it may be applied as a generation and training feedback mechanism.
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation.	demonstrates how LLM reasoning and generation in long-horizon generation tasks can be greatly enhanced by iteratively revising a chain of thoughts with information retrieval; the key idea is that each thought step is revised with pertinent retrieved information to the task query, the current and past thought steps; Retrieval Augmented Thoughts (RAT) is a zero-shot prompting approach that offers notable improvements over baselines that include vanilla RAG, zero-shot CoT prompting, and other baselines. RAT can be applied to various models such as GPT-4 and CodeLlama-7B to improve long-horizon generation tasks (e.g., creative writing and embodied task planning).
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking.	outlines Quiet-STaR, a generalization of STaR that enables language models (LMs) to acquire reasoning skills that are more scalable and general; Quiet-STaR gives LMs the ability to produce justifications for each token to explain the future text; it suggests a token-wise parallel sampling approach that enhances LM predictions by producing internal thoughts effectively; REINFORCE is used to improve the rationale creation.
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM.	suggests combining expert LLMs with a Mixture-of-Experts LLM as a more computationally efficient way to train LLMs. This method, called BTX, is shown to be more effective than training a single specialized LLM or a larger generalist LLM. It works by first training (in parallel) multiple copies of a seed LLM with specialized knowledge in different domains (i.e., expert LLMs), then combining them into a single LLM using MoE feed-forward layers. Finally, the entire unified model is fine-tuned.
Large language models surpass human experts in predicting neuroscience results.	suggests using BrainBench as a benchmark to assess LLMs' capacity to forecast neuroscience outcomes; discovers that LLMs outperform experts in forecasting the results of experiments; an LLM that has been modified based on neuroscience literature has been demonstrated to do even better.
Uni-SMART: Universal Science Multimodal Analysis and Research Transformer.	Comprehensive literature analysis faces a problem due to the scientific literature's constant increase. Because of their ability to summarize, LLMs present a viable option; yet, they are not well-suited to the multimodal aspects that are common in scientific information. Uni-SMART (Universal Science Multimodal Analysis and Research Transformer) was created to fill this vacuum by understanding and analyzing the intricate multimodal data found in scientific publications.
Mechanics of Next Token Prediction with Self-Attention.	Predicting the next token is a straightforward goal that triggers complex actions. This work discovered that the problem could be divided into two parts: soft composition and hard retrieval. This allowed for good overall performance and in-context learning, and the single self-attention layer was trained using gradient descent.
Knowledge Distillation in YOLOX-ViT for Side-Scan Sonar Object Detection.	By combining visual transformers with knowledge distillation, YOLOX-ViT presents a novel method for object recognition in underwater robots.
GroupContrast.	GroupContrast combines semantic-aware contrastive learning with segment grouping to redefine self-supervised 3D representation learning.
Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification.	With an emphasis on object-centric information, this study presents a novel approach to object detection in photos captured from a variety of spectrums, including RGB, near-infrared, and thermal imaging. The goal is to increase recognition accuracy by mitigating the effects of background noise.
Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation.	Stable Diffusion 3 is a potent model for creating images. Latent Adversarial Diffusion Distillation, which keeps picture production quality constant while reducing the number of diffusion stages to 4, is shown in this study.
Distilling Datasets Into Less Than One Image.	Poster Dataset Distillation (PoDD): We propose PoDD, a new dataset distillation setting for a tiny, under 1 image-per-class (IPC) budget. In this example, the standard method attains an accuracy of 35.5% on CIFAR-100 with approximately 100k pixels, and PoDD achieves an accuracy of 35.7% with less than half the pixels (roughly 40k)
MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control .	MineDreamer is an artificial intelligence (AI) bot that uses cutting-edge language and vision models creatively to obey intricate commands in the Minecraft universe.
DreamDA: Generative Data Augmentation with Diffusion Models.	DreamDA presents a novel method of data augmentation by creating high-quality, diversified synthetic visuals that closely resemble the original data distribution using diffusion models.
Chain-of-Spot: Interactive Reasoning Improves Large Vision-language Models.	The Interactive Reasoning method known as Chain-of-Spot (CoS) greatly improves the way Large Vision-Language Models (LVLMs) analyze and comprehend pictures. With CoS, LVLMs may obtain precise visual information without sacrificing picture quality by concentrating on specific regions of interest inside images in response to predetermined inquiries or commands.
StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On.	A new method for image-based virtual try-on is called StableVITON. This approach takes advantage of the creative capacity of diffusion models that have already been trained while paying attention to garment details. StableVITON discovers semantic correspondences in the latent space of a pre-trained model between clothing and the human body.
Diffusion-based Video Translation.	FRESCO is a unique method that greatly enhances the spatial-temporal consistency in video translation tasks by combining intra-frame and inter-frame correspondences.
Generalized Consistency Trajectory Models.	With the introduction of Generalized Consistency Trajectory Models (GCTMs), this effort improves the capabilities of diffusion models for tasks such as image restoration and editing. By translating between any two distributions in a single step, these models simplify the procedure and enable remarkably accurate and efficient image modification.
Introducing SceneScript, a novel approach for 3D scene reconstruction.	A model developed by Meta Reality Labs may convert visual input into a three-dimensional (3D) representation of a scene. The 70m parameter model has exceptional stability and operates rapidly on the device.
Scalable Diffusion Models with State Space Backbone.	A novel kind of diffusion model known as Diffusion State Space Models (DiS) uses a state space backbone for image data instead of the conventional U-Net. These models are effective at producing high-quality photos with little computing work and can manage long-range relationships.
PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns.	PuzzleVQA is a dataset created to evaluate big multimodal models such as GPT-4V's capacity for abstract thinking.

News

Link	description
Open Release of Grok-1.	We are releasing the weights and architecture of our 314 billion parameter Mixture-of-Experts model, Grok-1.
Did OpenAI just accidentally leak the next big ChatGPT upgrade?	OpenAI may have accidentally leaked details about a new AI model called GPT-4.5 Turbo. The leak suggests that GPT-4.5 Turbo will be faster, more accurate, and have a larger knowledge base than its predecessor.
Claude 3 Haiku: our fastest model yet.	Today we’re releasing Claude 3 Haiku, the fastest and most affordable model in its intelligence class. With state-of-the-art vision capabilities and strong performance on industry benchmarks
Midjourney debuts feature for generating consistent characters across multiple gen AI images.	The popular AI image generating service Midjourney has deployed one of its most oft-requested features: the ability to recreate characters consistently across new images.
Apple researchers achieve breakthroughs in multimodal AI as company ramps up investments.	Apple researchers have developed new methods for training large language models on both text and images, enabling more powerful and flexible AI systems, in what could be a significant advance for artificial intelligence and for future Apple products.
Introducing Stable Video 3D: Quality Novel View Synthesis and 3D Generation from Single Images.	Today we are releasing Stable Video 3D (SV3D), a generative model based on Stable Video Diffusion, advancing the field of 3D technology and delivering greatly improved quality and view-consistency.
Google researchers unveil ‘VLOGGER’, an AI that can bring still photos to life.	Google researchers have developed a new artificial intelligence system that can generate lifelike videos of people speaking, gesturing and moving — from just a single still photo. The technology, called VLOGGER, relies on advanced machine learning models to synthesize startlingly realistic footage, opening up a range of potential applications while also raising concerns about deepfakes and misinformation.
Microsoft has added the GPT-4 Turbo LLM to the free version of Copilot.	Microsoft is boosting the performance of its Copilot generative AI chatbot today. It has been confirmed that all free Copilot users can now access the GPT-4 Turbo large language model from OpenAI.
Korean researchers power-shame Nvidia with new neural AI chip — claim 625 times less power draw, 41 times smaller.	The new C-Transformer chip is claimed to be the world's first ultra-low power AI accelerator chip capable of large language model (LLM) processing.
Inflection co-founders leave for Microsoft AI.	Karén Simonyan and Mustafa Suleyman are leaving Inflection to launch Microsoft AI. The next CEO will be Sean White. Additionally, a few Inflection senior team members are joining Microsoft AI.
Lilac acquired by Databricks.	Lilac is a scalable, user-friendly tool for data scientists to search, cluster, and analyze any kind of text dataset with a focus on generative AI.
IBM and NASA build language models to make scientific knowledge more accessible.	In a new collaboration, IBM and NASA created a suite of efficient language models by training on scientific literature. Based on the transformer architecture, these models can be used in a variety of applications, from classification and entity extraction to question-answering and information retrieval. These models achieve high performance across a variety of domains and can respond promptly. We have open-sourced the models on Hugging Face for the benefit of the scientific and academic community.
Introducing RAG 2.0.	A technique for adding knowledge to a language model that can become stale is called retrieval augmented generation, or RAG. Unfortunately, outside of demonstrations, the current paradigm of "frozen RAG," in which just a portion of the pipeline is trained and the model itself is not updated, performs badly. This blog describes the next generation of RAG, where all the components are fine-tuned for the job at hand. In this system, an open model such as Mistral 7B can perform better than the conventional GPT-4 RAG.
Fitbit Using Google Gemini for New AI That Could Become Your Fitness Coach.	Google is training Gemini on health data, and it's creating a new AI model for the Fitbit app that can give advice tailored to your needs.
Stable Diffusion maker leaves Stability AI.	Robin Rombach helped build the tech that made Stability AI famous, now he's leaving the company
Introducing Copilot4D: A Foundation Model for Self-Driving.	Waabi's Copilot4D is a ground-breaking foundation model that advances the capabilities of autonomous machines by using LiDAR data to comprehend and forecast the 3D dynamics of the environment across time.
NLX Raises $15M in Series A Funding.	In March 2024, NLX extended its Series A funding to $15M, adding Comcast Ventures.
Triton Puzzles.	Triton is an alternative open-source language that allows you to code at a higher level and compile to accelerators like GPU. This set is puzzles is meant to teach you how to use Triton from first principles in an interactive fashion. You will start with trivial examples and build your way up to real algorithms like Flash Attention and Quantized neural networks. These puzzles do not need to run on GPU since they use a Triton interpreter.
New Breakthrough Brings Matrix Multiplication Closer to Ideal.	Researchers from Tsinghua University and UC Berkeley have made great strides in matrix multiplication, introducing a novel method that has already inspired improvements. Significant time, power, and cost savings in a variety of applications could result from this development in a fundamental computer procedure. Since the previous milestone in 2010, this is the most significant advancement in lowering the computational cost of matrix multiplication.
OpenAI could release GPT-5 in a few months: Report.	OpenAI could release GPT-5, the next-generation of its groundbreaking large language model, in a few months, according to a new report.
Beijing court’s ruling that AI-generated content can be covered by copyright eschews US stand, with far-reaching implications on tech’s use.	The Beijing Internet Court ruled that an AI-generated image in an intellectual property dispute was an artwork protected by copyright laws. That decision is expected to have far-reaching implications for future AI copyright disputes, which could eventually benefit Chinese Big Tech companies.
Japan’s premier AI lab launches its first model.	Sakana AI develops cutting-edge models for Japanese language, vision, and picture production. In order to evolve foundation models without the need for costly retraining, it introduced an evolutionary model merging. The model merging and a description of the process are now available.
Cohere’s Command-R Enterprise Model Coming to ai.nvidia.com.	The RAG-optimized Command-R model from Cohere, which is intended to help enterprises transition to large-scale production, will soon be available in the freshly released NVIDIA API catalog.
Biden-Harris Administration Announces Deal with Intel for AI Chips.	Biden-Harris Administration Announces Preliminary Terms with Intel to Support Investment in U.S. Semiconductor Technology Leadership and Create Tens of Thousands of Jobs
Apple’s AI ambitions could include Google or OpenAI.	The iPhone-maker is in ‘active’ talks to bring Gemini to the iPhone and has also considered using ChatGPT.
World’s first major act to regulate AI passed by European lawmakers.	The European Union’s parliament on Wednesday approved the world’s first major set of regulatory ground rules to govern the mediatized artificial intelligence at the forefront of tech investment. Born in 2021, the EU AI Act divides the technology into categories of risk, ranging from “unacceptable” — which would see the technology banned — to high, medium and low hazard.

Resources

Link	description
tlm - Local CLI Copilot, powered by CodeLLaMa.	tlm is your CLI companion which requires nothing except your workstation. It uses the most efficient and powerful CodeLLaMa in your local environment to provide you with the best possible command line suggestions.
Multi-node LLM Training on AMD GPUs.	The whole stack of technologies, including schedulers, model training software, and more, that Lamini employs to train models on AMD GPUs is described in this blog article.
clarity-upscaler.	A state-of-the-art image upscaling tool.
musiclang_predict.	Music Lang is an API and set of models that generate music.
Optimizing Technical Docs for LLMs.	Capa.ai provides guidance on how to organize LLM documentation, including how to include troubleshooting FAQs, self-contained code snippets, segmentation into sub-products, and community forum creation.
lamini/earnings-calls-qa.	This dataset contains transcripts of earning calls for various companies, along with questions and answers related to the companies' financial performance and other relevant topics.
Knowledge Conflicts for LLMs: A Survey.	a summary of the prevalent problem of knowledge conflict that arises while working with LLMs; the survey article divides these conflicts into three categories: intra-memory, inter-context, and context-memory conflict. It also offers insights into the sources of these conflicts and possible solutions.
Enhancing RAG-based application accuracy by constructing and leveraging knowledge graphs.	A practical guide to constructing and retrieving information from knowledge graphs in RAG applications with Neo4j and LangChain
How to Evaluate Your RAG System?	Retrieval Augmented Generation (RAG) is a powerful technique that enhances output quality by retrieving relevant context from an external vector database. However, building and evaluating an RAG system can be challenging, especially when it comes to measuring performance. In this post, we'll explore the most effective metrics for each stage of your RAG pipeline and how to use them to evaluate your whole system.
Anthropic Prompt Library.	Although Claude 3 has been widely used, these models use a somewhat different prompting technique. Anthropic has compiled a list of user prompts that are effective for a wide range of assignments and subjects.
Pretraining 16 language models on different tokenizers.	One peculiarity of contemporary language modeling is that the model is not trained until the tokenizer has been trained. The second peculiar truth is that, on vast scales, vocabulary size doesn't appear to matter all that much.
LLM4Decompile.	Reverse Engineering: Decompiling Binary Code with Large Language Models
Under The Hood: How OpenAI's Sora Model Works.	In this blog post, we dive into some of the technical details behind Sora. We also talk about our current thinking around the implications of these video models. Finally, we discuss our thoughts around the compute used for training models like Sora and present projections for how that training compute compares to inference, which has meaningful indications for estimated future GPU demand.
Quiet-STaR.	A reasoning framework called Quiet-Star enhances language models' capacity to produce accurate results. An eight-step model per token has been given along with the code.
MoE-Adapters4CL.	Continual learning can empower vision-language models to continuously acquire new knowledge, without the need for access to the entire historical dataset. Through extensive experiments across various settings, our proposed method consistently outperforms previous state-of-the-art approaches while concurrently reducing parameter training burdens by 60%.
LlamaGym.	Fine-tune LLM agents with online reinforcement learning
Stylized image binning algorithm.	This is a tutorial on utilizing a JavaScript binning method to create an image processing application that looks like pixel art and has customizable interactive web features like sliders. By averaging pixel brightness inside bins, the binning technique transforms photos into stylized, pixelated artwork by utilizing parameters like bin size and spacing. The approach entails efficiently optimizing looping structures and modifying pixel data on HTML canvas components.
TorchTune.	TorchTune is a native-Pytorch library for easily authoring, fine-tuning and experimenting with LLMs.
MVFA-AD.	Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images

Perspectives

Link	description
What I learned from looking at 900 most popular open source AI tools.	By examining the GitHub stars of well-known AI models, we can uncover some fascinating patterns. The majority of open-source AI tools appear to be geared at apps and infrastructure.
LLM inference speed of light.	This article explores the "speed of light" theoretical limit for transformer-based language model inference and emphasizes the significance of memory bandwidth over computational power, showing that the ability to read data from memory rather than perform calculations is the primary constraint on inference speed and that this is an important factor to optimize and comprehend the performance of AI.
AI is bad/good actually.	This article's author suggests eschewing the nebulous good/bad continuum and instead use terminology like "harmful," "helpful," "capable," and "incapable" to distinguish AI conversations. For them, AI is capable yet possibly dangerous because of unresolved problems like bias exaggeration and copyright infringement. Using these more precise phrases, the author asks readers to explain their own opinions on AI
Captain's log: the irreducible weirdness of prompting AIs.	A wealth of free AI and machine learning tools can be found on the new companion website, More Useful Things. These resources highlight the amusing and useful ways in which AI-generated prompts, such as creative scenarios, can surpass human-crafted ones in tasks like solving mathematical puzzles. For more consistent prompting outcomes, the experiment emphasizes the value of adding context, few-shot learning, and chain-of-thought strategies. Though organized prompting is still an evolving art with considerable potential benefits, prompting as a talent may become less important as AI models advance and get better at inferring user intent.
AI Prompt Engineering Is Dead, Long live AI prompt engineering.	According to recent studies, as AI and machine learning models get better at optimizing their own prompts, human prompt engineers might become outdated. Prompts produced by algorithms can be strange but powerful; they exceed those created by humans and significantly cut down on optimization time. Despite the potential of automatically adjusted prompts, experts predict that the need for occupations related to prompts will change rather than vanish, maybe taking the form of new positions like LLMOps (Large Language Model Operations).
The Road to Biology 2.0 Will Pass Through Black-Box Data.	This year marks perhaps the zenith of expectations for AI-based breakthroughs in biology, transforming it into an engineering discipline that is programmable, predictable, and replicable. Drawing insights from AI breakthroughs in perception, natural language, and protein structure prediction, we endeavor to pinpoint the characteristics of biological problems that are most conducive to being solved by AI techniques. Subsequently, we delineate three conceptual generations of bio AI approaches in the biotech industry and contend that the most significant future breakthrough will arise from the transition away from traditional “white-box” data, understandable by humans, to novel high-throughput, low-cost AI-specific “black-box” data modalities developed in tandem with appropriate computational methods.
"AI, no ads please": 4 words to wipe out $1tn.	AI poses a huge threat to ad-based platforms by slashing how many ads we see
OpenAI’s “Own Goal”.	And why it is becoming increasingly difficult to take them at their word
What if it isn't happening, AGI is not coming?	No matter what appears to be happening, we always have to consider what if it isn't. What If LLMs fail to turn into AGIs? Has our quest for intelligence simply unveiled our demonstrable lack thereof? Will trillions of dollars turn unpredictable hallucination machines into reliable universal productivity tools that can do anything?
How OpenAI’s text-to-video tool Sora could change science – and society.	OpenAI’s debut of its impressive Sora text-to-video tool has raised important questions.
Chatbot AI makes racist judgements on the basis of dialect.	Some large language models harbor hidden biases that cannot be removed using standard methods.
Could AI-designed proteins be weaponized? Scientists lay out safety guidelines.	AI tools that can come up with protein structures at the push of a button should be used safely and ethically, say researchers in the field.
Three reasons why AI doesn’t model human language.	Artificial intelligence (AI) is being used to develop large language models (LLMs) with considerable success. But they should not be seen as being models of how human language works and is acquired.
So … you’ve been hacked.	Research institutions are under siege from cybercriminals and other digital assailants. How do you make sure you don’t let them in?
8 Google Employees Invented Modern AI. Here’s the Inside Story.	They met by chance, got hooked on an idea, and wrote the “Transformers” paper—the most consequential tech breakthrough in recent history.
Using LLMs to Generate Fuzz Generators.	Claude and other LLMs are capable of producing efficient fuzzes for code parsing, automating a task that has historically required a great deal of human labor. Given that fuzzing is stochastic, LLMs seem to be a good fit for producing fuzzes, even if they are usually not exact enough for static analysis. To find and exploit code vulnerabilities, a hybrid approach that combines targeted fuzzing and LLM-driven static analysis may be promising.
First Impressions of Early-Access GPT-4 Fine-Tuning.	A few weeks ago we finally got access to the GPT-4 fine-tuning API (in limited early access), and were super excited to check out how well it works. We’d been a user of OpenAI’s fine-tuned models since fine-tuning the original GPT-3 Davinci model first became available.
AI and the Future of Work.	High Mensa exam scores for Anthropic's most recent AI, Claude, indicate that self-improving AI is not far off and presents both prospects and existential concerns. As seen at Klarna, where a customer support AI replaced 700 workers, machine learning is already eliminating jobs. This suggests that automation is becoming more and more common. Recent layoffs at Duolingo as a result of AI's translation capabilities highlight this change and the increasing influence of AI on the nature of work in the future.
Two years later, deep learning is still faced with the same fundamental challenges.	Gary Marcus revisits his forecasts two years after writing a pessimistic AI paper, and he maintains his original mistrust. Even with breakthroughs like GPT-4, basic problems like true understanding and reliable AI are still unsolved. Marcus draws the conclusion that multidisciplinary cooperation is essential to achieving AGI and that increasing data and processing capacity alone won't be enough.
From 0 to 10 million users in four years.	In just four years, the AI-powered writing tool Copy.ai has amassed an amazing 10 million users.

Back to index

ML news: Week 11 - 17 March

Research

Link	description
Yi: Open Foundation Models by 01.AI.	One of the most potent open language models for a long time has been the Yi model. The group has published a document that offers significant new information about how they gather data and train employees.
From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models.	This research uses translation to enhance safety measures in situations when direct data is not available, so taking on the task of minimizing dangerous material in AI across many languages.
Plum: Prompt Learning using Metaheuristic.	In this research, a broad class of more than 100 discrete optimization techniques known as metaheuristics is presented as a potent tool for enhancing rapid learning in big language models.
ViewFusion: Towards Multi-View Consistency via Interpolated Denoising.	A new technique called ViewFusion aims to enhance the way diffusion models produce images from fresh angles while maintaining the consistency of the images from one view to another.
Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap.	reveals that there is a reasoning gap between the current models and the proposed functional benchmarks for evaluating the reasoning abilities of LLMs, ranging from 58.35% to 80.31%. However, the authors also note that these gaps can be closed with more advanced prompting techniques.
Can Large Language Models Reason and Plan?	The subject of thinking and planning for LLMs is covered in a recent position paper. The following is an overview of the author's findings: In summary, I don't have any strong evidence from anything I've read, checked, or done to suggest that LLMs engage in typical reasoning or planning. Instead, they use web-scale training to perform a type of universal approximate retrieval, which is sometimes confused for reasoning abilities, as I have explained."
KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents.	we introduce KnowAgent, a novel approach designed to enhance the planning capabilities of LLMs by incorporating explicit action knowledge. Specifically, KnowAgent employs an action knowledge base and a knowledgeable self-learning strategy to constrain the action path during planning, enabling more reasonable trajectory synthesis, and thereby enhancing the planning performance of language agents.
Stealing Stable Diffusion Prior for Robust Monocular Depth Estimation.	The new Stealing Stable Diffusion (SSD) method improves monocular depth estimate performance in challenging settings such as low light or wet ones.
VideoElevator : Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models.	Using the advantages of text-to-image models, VideoElevator presents a unique method that improves text-to-video diffusion models. Videos with better frame quality and text alignment are produced by dividing the improvement process into two parts: fine-tuning temporal motion and improving spatial quality. This is known as the plug-and-play approach.
Face2Diffusion for Fast and Editable Face Personalization.	Gaussian Splatting is combined with 3D mesh geometry in SplattingAvatar to create vibrant virtual humans, introducing a novel method for producing lifelike virtual humans.
Stealing Part of a Production Language Model.	By leveraging their public APIs, you may obtain parts of closed language models—like the embeddings layer—for free. A simple budget of less than $2,000 may do this.
Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling.	DNA sequence prediction model developed on the Transformer rival Mamba platform. For a little model, it is incredibly powerful and efficient.
V3D: Video Diffusion Models are Effective 3D Generators.	In order to improve 3D object production, this research presents a revolutionary method that creates detailed, high-quality objects from a single photograph.
A generalist AI agent for 3D virtual environments.	We present new research on a Scalable Instructable Multiworld Agent (SIMA) that can follow natural-language instructions to carry out tasks in a variety of video game settings
SSM Meets Video Diffusion Models: Efficient Video Generation with Structured State Spaces.	By concentrating on linear memory consumption, this study overcomes the memory limitations of conventional attention-based diffusion models and presents a novel method for producing videos using state-space models (SSMs). As tested with the UCF101 and MineRL Navigate datasets, SSMs allow the generation of lengthier video sequences with competitive quality.
SemCity: Semantic Scene Generation with Triplane Diffusion.	SemCity transforms 3D scene production by emphasizing real-world outdoor environments—a problem that is sometimes disregarded because of how difficult and sparse outdoor data may be.
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM.	This study demonstrates how to train several models and combine them into a single Mixture-of-Experts model.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code.	It is difficult to evaluate language models that have been taught to code. The majority of people utilize OpenAI's HumanEval. Some open models, nevertheless, appear to overfit this standard. Coding performance may be measured while reducing contamination issues with LiveCodeBench.
Evil Geniuses: Delving into the Safety of LLM-based Agents.	'Evil Geniuses' is a virtual squad that researchers utilized in a recent study to examine the safety of LLMs. They discovered that these AI agents are less resistant to malevolent attacks, give more nuanced answers, and make it more difficult to identify improper responses.
ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions.	In this work, a novel backbone architecture called ViT-CoMer is presented, which improves on Vision Transformers (ViT) for dense prediction tasks without requiring pre-training.
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training.	Apple just released a multimodal model and discussed how they trained in detail.

News

Link	description
OpenAI announces new members to board of directors.	Dr. Sue Desmond-Hellmann, Nicole Seligman, Fidji Simo join; Sam Altman rejoins board
So long and thanks for all the pixels: Nvidia reportedly retiring the GTX brand for good.	Nvidia has stopped producing GPUs based on its Turing architecture. The last of them included the likes of the GTX 1660, 1650, and 1630 series of GPUs. Once remaining stocks sell, they'll be gone and with them, the "GTX" brand itself, leaving all Nvidia gaming graphics cards as "RTX" models.
Google’s upcoming Tensor G4 Chip set to rival Snapdragon 8 Gen 4 and Apple A18 Pro.	Let’s say you’re a smartphone manufacturer aiming to develop a new model. You have two options: partner with an established chipmaker like Qualcomm or MediaTek or follow the path of Apple by designing your own custom chipset. Google has taken a similar approach, developing its in-house Tensor processors. Recent information suggests the Pixel 9 will feature the Tensor G4 chipset, promising improved heat and power management for an enhanced user experience.
Microsoft may debut its first 'AI PCs' later this month.	A report suggests an OLED Surface Pro 10 and Surface Laptop 6 are imminent.
Looks like we may now know which OpenAI execs flagged concerns about Sam Altman before his ouster.	Two OpenAI execs raised concerns about Sam Altman before his ouster, The New York Times reported. The outlet reported that the company's chief technology officer, Mira Murati, played a key role. Altman returned as CEO in days, leaving many unanswered questions about what happened.
Cloudflare announces Firewall for AI.	Today, Cloudflare is announcing the development of a Firewall for AI, a protection layer that can be deployed in front of Large Language Models (LLMs) to identify abuses before they reach the models.
Google announces they are tackling spammy, low-quality content on Search.	We’re making algorithmic enhancements to our core ranking systems to ensure we surface the most helpful information on the web and reduce unoriginal content in search results. We’re updating our spam policies to keep the lowest-quality content out of Search, like expired websites repurposed as spam repositories by new owners and obituary spam.
This week, xAI will open-source Grok.	Official tweet of Elon Musk
Covariant is building ChatGPT for robots.	The UC Berkeley spinout says its new AI platform can help robots think more like people. Covariant this week announced the launch of RFM-1 (Robotics Foundation Model 1).
AI solves huge problem holding back fusion power.	Princeton researchers have trained an AI to predict and prevent a common problem arising during nuclear fusion reactions — and they think it might be able to solve other problems, too.
Midjourney bans all Stability AI employees over alleged data scraping.	Midjourney blamed a near 24-hour service outage on ‘botnet-like activity’ from two accounts linked to the Stable Diffusion creator.
Microsoft compares The New York Times’ claims against OpenAI to Hollywood’s early fight against VCR.	Microsoft is helping OpenAI fight back against claims of copyright infringement by The New York Times. The news outlet’s lawsuit, filed in December, seeks to hold Microsoft and OpenAI accountable for billions of dollars in damages. In a court filing on Monday, Microsoft accuses the publisher of “unsubstantiated” claims that the use of OpenAI’s technology is harming its business.
Introducing Devin, the first AI software engineer.	Devin, a new system from Cognition, receives a 14% on the difficult SWE-Bench benchmark, which evaluates AI's capacity for writing code. GPT-4 received a 1.7% score. This model demonstrates excellent contextual learning skills.
Building Meta’s GenAI Infrastructure.	The Llama 3 training infrastructure is described in this Meta blog article. It covers networking, storage, Pytorch, NCCL, and many enhancements. This will prepare the way for Meta's H100s to go online throughout the course of the remaining months of this year.
Physical Intelligence Raises $70M to Build AI-Powered Robots for Any Application.	Pi differentiates itself by aiming to create software that can be applied across a wide range of robotics hardware.
Researchers create AI worms that can spread from one system to another.	Worms could potentially steal data and deploy malware. Now, in a demonstration of the risks of connected, autonomous AI ecosystems, a group of researchers has created one of what they claim is the first generative AI worms—which can spread from one system to another, potentially stealing data or deploying malware in the process.
Perplexity brings Yelp data to its chatbot.	Perplexity’s responses can source multiple Yelp reviews for that cafe you were considering, along with location data and other information.
Gemini now lets you tune and modify responses with a prompt.	Google is launching “a more precise way for you to tune Gemini’s responses” on the web app. When selecting (by highlighting) a part of Gemini’s response to your prompt, a pencil/sparkle icon appears to “Modify selected text.” This opens a box with Regenerate, Shorter, Longer, and Remove options, as well as an open text field.
Microsoft’s neural voice tool for people with speech disabilities arrives later this year.	At the Microsoft Ability summit today, the company is continuing to raise awareness about inclusive design.
Together AI $106M round of funding.	we’ve raised $106M in a new round of financing led by Salesforce Ventures with participation from Coatue, and existing investors.
Autonomous Vehicle Startup Applied Intuition Hits $6B Valuation After $250M Series E.	Autonomous vehicle software developer Applied Intuition locked up a $250 million Series E valuing the company at a $6 billion — a 67% uptick in value from its previous round. The deal comes even as venture funding for autonomous vehicle-related startups has been in decline in recent years.
OpenAI CTO Says It’s Releasing Sora This Year.	But now, OpenAI chief technology officer Mira Murati told the Wall Street Journal that the company will publicly release Sora "later this year."
Google now wants to limit the AI-powered search spam it helped create.	Ranking update targets sites "created for search engines instead of people."
OpenAI Partners With Le Monde And Prisa Media.	We have partnered with international news organizations Le Monde and Prisa Media to bring French and Spanish news content to ChatGPT.
World’s first major act to regulate AI passed by European lawmakers.	The European Union’s parliament on Wednesday approved the world’s first major set of regulatory ground rules to govern the mediatized artificial intelligence at the forefront of tech investment.
Figure 01 can now have full conversations with people.	Figure's robots can now hold in-depth discussions with humans thanks to the integration of OpenAI's technology. While Figure's neural networks provide quick, low-level dexterous robot operations, OpenAI's models offer high-level visual and linguistic intelligence. This X article includes a video of a human conversing with a Figure robot, teaching it how to complete tasks, explaining the rationale behind the tasks, and providing a self-evaluation of the activities' effectiveness.
Claude 3 Is The Most Human AI Yet.	Claude 3, Anthropic's latest AI model, is distinguished by its "warmth," which makes it a reliable collaborator on creative writing assignments. More human-feeling and lifelike, Claude 3 is said to straddle the line between delightful deep contemplation and good thought. Though this subtlety has not been fully captured by technological benchmarks, Claude 3 is set to transform our relationship with AI in creative processes.
From Wait Times to Real-Time: Assort Health Secures $3.5 Million to Scale First Generative AI for Healthcare Call Centers.	Solution Erases Long Phone Holds for Patients, Supports Overwhelmed Medical Front Desk Workers and Improves Patient Access to Physicians

Resources

Link	description
DeepSpeed-FP6: The Power of FP6-Centric Serving for Large Language Models.	A recent upgrade to Microsoft's robust DeepSpeed training package lets models use up to six bits per parameter. This can expedite inference by a factor of more than two.
You can now train a 70b language model at home.	a fully open source system that, for the first time, can efficiently train a 70b large language model on a regular desktop computer with two or more standard gaming GPUs (RTX 3090 or 4090). This system, which combines FSDP and QLoRA, is the result of a collaboration between Answer.AI, Tim Dettmers (U Washington), and Hugging Face’s Titus von Koeller and Sourab Mangrulkar.
Retrieval-Augmented Generation for AI-Generated Content: A Survey.	gives a summary of RAG's application in several generating contexts, such as code, images, and audio, and includes a taxonomy of RAG upgrades along with citations to important works.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models.	Based on public technical reports and reverse engineering, this paper presents a comprehensive review of the model's background, related technologies, applications, remaining challenges, and future directions of text-to-video AI models.
SaulLM-7B: A pioneering Large Language Model for Law.	With 7 billion parameters, SaulLM-7B is the first LLM designed explicitly for legal text comprehension and generation. Leveraging the Mistral 7B architecture as its foundation, SaulLM-7B is trained on an English legal corpus of over 30 billion tokens.
A Practical Guide to RAG Pipeline Evaluation (Part 1: Retrieval).	Retrieval is a critical and complex subsystem of the RAG pipelines. After all, the LLM output is only as good as the information you provide it unless your App relies solely on the training data of the LLM. The core is measuring retrieval is assessing whether each of the retrieved results is relevant for a given query.
C4AI Command-R.	C4AI Command-R is a research release of a 35 billion parameter highly performant generative model. Command-R is a large language model with open weights optimized for a variety of use cases including reasoning, summarization, and question-answering. Command-R has the capability for multilingual generation evaluated in 10 languages and highly performant RAG capabilities.
Artificial Intelligence Controller Interface (AICI).	The Artificial Intelligence Controller Interface (AICI) lets you build Controllers that constrain and direct the output of a Large Language Model (LLM) in real-time. Controllers are flexible programs capable of implementing constrained decoding, dynamic editing of prompts and generated text, and coordinating execution across multiple, parallel generations.
US Public Domain Books (English).	This dataset contains more than 650,000 English books (~ 61 billion words) presumed to be in the public domain in the US which were digitized by the Internet Archive and cataloged as part of the Open Library project.
transformer-debugger.	Transformer Debugger (TDB) is a tool developed by OpenAI's Superalignment team with the goal of supporting investigations into specific behaviors of small language models. The tool combines automated interpretability techniques with sparse autoencoders.
VideoMamba.	VideoMamba is a technology that effectively manages global dependencies and local redundancy to tackle the challenges of video interpretation.
FastV.	FastV is a plug-and-play inference acceleration method for large vision language models relying on visual tokens. It could reach a 45% theoretical FLOP reduction without harming the performance through pruning redundant visual tokens in deep layers.
Maximizing training throughput using PyTorch FSDP.	Together, teams from IBM and Meta have achieved 57% MFU by rapidly training potent models in parallel on huge A100 and H100 clusters.
MoAI.	MoAI is a new large language and vision model that integrates auxiliary visual data from specific computer vision tasks to improve upon existing models.
superopenai: logging and caching superpowers for the openai sdk.	superopenai is a minimal convenience library for logging and caching LLM requests and responses for visibility and rapid iteration during development.
TripoSR.	TripoSR, a state-of-the-art open-source model for fast feedforward 3D reconstruction from a single image, collaboratively developed by Tripo AI and Stability AI.
Exploring Alternative UX Patterns for GenAI Interfaces.	In the rapidly evolving landscape of GenAI interfaces, it is crucial to venture beyond the established norms. The current dominance of Quick Actions and Multi-Turn engagement patterns in these interfaces, while effective in many cases, should not limit our imagination or hinder the potential for innovation.
rerankers.	Rerankers are an important part of any retrieval architecture, but they're also often more obscure than other parts of the pipeline. rerankers seeks to address this problem by providing a simple API for all popular rerankers, no matter the architecture.
skyvern.	Skyvern automates browser-based workflows using LLMs and computer vision. It provides a simple API endpoint to fully automate manual workflows, replacing brittle or unreliable automation solutions.
Licensing AI Means Licensing the Whole Economy.	Because artificial intelligence is a process that is essential to many different economic uses, it is not possible to regulate it like a physical thing.
Enhancing RAG-based application accuracy by constructing and leveraging knowledge graphs.	A practical guide to constructing and retrieving information from knowledge graphs in RAG applications with Neo4j and LangChain
You can now train a 70b language model at home.	We’re releasing an open-source system, based on FSDP and QLoRA, that can train a 70b model on two 24GB GPUs.
pricing sheet with all popular token-based pricing providers and the top-performing models.	Princing and comparison between different LLMs

Perspectives

Link	description
Winning Strategies for Applied AI Companies.	Key Success Factors after reviewing over 70 companies that have raised at least $7M
AI startups require new strategies: This time it’s actually different.	The typical dynamics between startups and incumbents do not apply in AI as they did in previous technology revolutions like mobile and the Internet. Ignore this at your peril.
The GPT-4 barrier has finally been broken.	Four weeks ago, GPT-4 remained the undisputed champion: consistently at the top of every key benchmark, but more importantly the clear winner in terms of “vibes”. Today that barrier has finally been smashed. We have four new models, all released to the public in the last four weeks, that are benchmarking near or even above GPT-4.
Embrace AI to break down barriers in publishing for people who aren’t fluent in English.	E. M. Wolkovich describes having a paper rejected because of an unfounded accusation that ChatGPT was used to write it. We think that both the rejection and the bias against the use of artificial intelligence (AI) in scientific writing are misguided.
Why scientists trust AI too much — and what to do about it.	Some researchers see superhuman qualities in artificial intelligence. All scientists need to be alert to the risks this creates.
The Future of Poetry.	Questions about whether poems were authored by humans or artificial intelligence (AI) were given to 38 AI experts and 39 English experts. First prize went to The Human, followed by Bard, ChatGPT-4, and Claude in that order, for both writing quality and the ability to deceive respondents into thinking that the poetry was written by a human. The fact that English specialists were far better at identifying which poems were composed by AI suggests that they should be involved more in the development of upcoming AI systems.
Barack Obama on AI, free speech, and the future of the internet.	The former president joined me on Decoder to discuss AI regulation, the First Amendment, and of course, what apps he has on his home screen.
AI startups require new strategies: This time it’s actually different.	The typical dynamics between startups and incumbents do not apply in AI as they did in previous technology revolutions like mobile and the Internet. Ignore this at your peril.
Top AIs still fail IQ tests - When asked to read image-based questions.	According to recent testing, sophisticated AI models such as ChatGPT-4 and Google's "Gemini Advanced" do poorly on visual IQ tests, receiving lower-than-average scores. Although ChatGPT-4 exhibits mediocre pattern recognition abilities, it misidentifies objects visually and makes logical mistakes, indicating a considerable difference in comparison to human intellect. These results suggest that the development of universally intelligent AI systems may still be some way off.
The Top 100 Gen AI Consumer Apps.	Over 40% of the top web products are new, having entered the top 50 in the last six months, according to Andreessen Horowitz's most recent consumer analysis on the top 100 Gen AI consumer apps.
This Nvidia Cofounder Could Have Been Worth $70 Billion. Instead, He Lives Off The Grid.	If Curtis Priem, Nvidia’s first CTO, had held onto all his stock, he’d be the 16th richest person in America. Instead, he sold out years ago and gave most of his fortune to his alma mater Rensselaer Polytechnic Institute.
How to thrive in a crowded enterprise AI market.	At a Lightspeed event, Arvind Jain, CEO of Glean, spoke on the difficulties and solutions facing corporate AI startups. He emphasized the need to provide genuine business value, being tenacious in hiring, and placing a higher priority on product quality than speed and cost. Jain also emphasized how privacy and security issues have slowed down the deployment of generative AI tools in businesses. Glean wants to become a widely used workplace AI platform that completely transforms how people work by becoming firmly integrated into organizational operations.
As AI tools get smarter, they’re growing more covertly racist, experts find.	ChatGPT and Gemini discriminate against those who speak African American Vernacular English, report shows

Back to index

ML news: Week 4 - 10 March

Research

Link	description
HyperAttention: Long-context Attention in Near-Linear Time.	It's well accepted—and informally verified—that HyperAttention is the key to Gemini's incredible 1 million+ token context window's success.
Why do Learning Rates Transfer? Reconciling Optimization and Scaling Limits for Deep Learning.	An attempt is made to explain the success of MuP hyperparameter transfer theoretically in this study. The greatest eigenvalue of the training loss Hessian, according to its creators, is unaffected by the network's depth or breadth.
WebArena: A Realistic Web Environment for Building Autonomous Agents.	The possibility for Agents to handle a range of digital responsibilities has the community enthused. But even the most advanced general-purpose models find it difficult to accomplish jobs where people achieve more than 70% of the time. It is becoming evident that these activities could require models that have been carefully trained.
Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models.	Latent space smoothness in text-to-image diffusion models is a problem that is addressed by a novel method called Smooth Diffusion. With this technique, even little changes in input will result in a steady and progressive alteration of the visuals.
Rethinking Inductive Biases for Surface Normal Estimation.	A technique called DSNIE significantly enhances monocular surface normal estimation, which finds use in various computer graphics fields.
CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition.	CricaVPR presents a revolutionary method that focuses on the relationships between many photos, even when they are taken in various situations, in order to improve visual place identification.
Empowering Large Language Model Agents through Action Learning.	investigates open-action learning for language agents using an iterative learning strategy that uses Python functions to create and improve actions; on each iteration, the proposed framework (LearnAct) modifies and updates available actions based on execution feedback, expanding the action space and improving action effectiveness; the LearnAct framework was tested on Robotic planning and AlfWorld environments, showing 32% improvement in agent performance in AlfWorld when compared to ReAct+Reflexion.
PlanGPT: Enhancing Urban Planning with Tailored Language Model and Efficient Retrieval.	demonstrates how to use LLMs to integrate several approaches, such as retrieval augmentation, fine-tuning, tool utilization, and more; while the suggested framework is used in the context of urban and spatial planning, many of the insights and useful advice are applicable to other fields as well.
Evo: Long-context modeling from molecular to genome scale.	Introducing Evo, a long-context biological foundation model based on the StripedHyena architecture that generalizes across the fundamental languages of biology: DNA, RNA, and proteins. Evo is capable of both prediction tasks and generative design, from molecular to whole genome scale (over 650k tokens in length). Evo is trained at a nucleotide (byte) resolution, on a large corpus of prokaryotic genomic sequences covering 2.7 million whole genomes.
Resonance RoPE: Improving Context Length Generalization of Large Language Models.	To assist LLMs in comprehending and producing text in longer sequences than they were first trained on, researchers have created a new method dubbed Resonance RoPE. By using less processing power, our approach outperforms the current Rotary Position Embedding (RoPE) technique and improves model performance on lengthy texts.
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World.	The All-Seeing Project V2 introduces the ASMv2 model, which blends text generation, object localization, and understanding the connections between objects in images.
GPQA: A Graduate-Level Google-Proof Q&A Benchmark.	A formidable task is offered by a new dataset named GPQA, which has 448 difficult multiple-choice questions covering physics, chemistry, and biology. Even domain specialists have difficulty—they only score about 65% accuracy—while non-experts only get 34%. Only 39% of advanced AI systems, such as GPT-4, are accurate. The goal of this dataset is to provide techniques for monitoring AI results in challenging scientific problems.
SURE: SUrvey REcipes for building reliable and robust deep networks.	SURE is a revolutionary strategy that integrates multiple approaches to increase the accuracy of deep neural network uncertainty predictions, particularly for image classification applications.
Stable Diffusion 3: Research Paper.	Stable Diffusion 3 outperforms state-of-the-art text-to-image generation systems such as DALL·E 3, Midjourney v6, and Ideogram v1 in typography and prompt adherence, based on human preference evaluations. Our new Multimodal Diffusion Transformer (MMDiT) architecture uses separate sets of weights for image and language representations, which improves text understanding and spelling capabilities compared to previous versions of SD3.
Researchy Questions: A Dataset of Multi-Perspective, Decompositional Questions for LLM Web Agents.	These days, language models are quite good at responding to queries. As a result, the majority of benchmarks in use today are saturated. 'Researchy' questions are a new breed of open-ended questions that call for several steps to complete. The source of this specific dataset is search engine queries. It includes instances where GPT-4 had trouble responding to questions.
UniCtrl: Improving the Spatiotemporal Consistency of Text-to-Video Diffusion Models via Training-Free Unified Attention Control.	A novel method for improving motion quality and semantic coherence in films produced by text-to-video models is presented by UniCtrl. Employing motion injection and cross-frame self-attention approaches enhances video coherence and realism without requiring further training.
VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT.	With natural language queries, VTG-GPT provides a revolutionary GPT-based technique that can precisely identify particular video segments without the need for fine-tuning or training.
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training.	With the same performance as OpenAI's original CLIP model, MobileClip operates seven times quicker. It may be utilized for a variety of language and visual activities on-device.
Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures.	Vision-RWKV provides an effective solution for high-resolution image processing by modifying the RWKV architecture from NLP for use in vision challenges.
Design2Code: How Far Are We From Automating Front-End Engineering?	It's hard to take pictures of a design and turn them into code. This study suggests an 18B model as a baseline and assessments imply that we are about there for performing this on basic designs. GPT-4V-generated code is sometimes preferred to human-synthesized code.
MathScale: Scaling Instruction Tuning for Mathematical Reasoning.	Researchers created two million route issues using fake data. After training a 7B model, they discovered that it performed well when compared to the most advanced big language models.
Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos.	The KEPP system offers a fresh method for organizing and carrying out difficult jobs. The approach, which makes use of a probabilistic knowledge network, enables the model to arrange activities logically to accomplish a goal.
KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents.	KnowAgent presents an innovative method for enhancing the planning abilities of big language models through the incorporation of explicit action information. The method leads LLMs through more rational planning trajectories, which improves their performance on challenging tasks.
tinyBenchmarks: evaluating LLMs with fewer examples.	In this paper, we investigate strategies to reduce the number of evaluations needed to assess the performance of an LLM on several key benchmarks. This work shows that you can reliably evaluate language model performance with as few as 100 examples from popular benchmarks.
3D Diffusion Policy.	DP3 presents a novel method for imitation learning that effectively teaches robots difficult abilities by fusing diffusion strategies with 3D visual data.
Co-LLM: Learning to Decode Collaboratively with Multiple Language Models.	Using an innovative approach, multiple huge language models can collaborate by alternately producing text token by token. With the use of this tactic, models are better able to apply their distinct advantages and areas of competence to a variety of activities, including following instructions, answering questions related to a given domain, and solving reasoning-based problems.

News

Link	description
AI-generated images of Trump with Black voters being spread by supporters.	No evidence to tie fake images, including one created by Florida radio host, to Trump campaign, BBC Panorama investigation finds
Elon Musk sues OpenAI over AI threat.	OpenAI is not so open now, Musk claims, following the closed-source release of the company's artificial general intelligence technology under Microsoft.
OpenAI wants to make a walking, talking humanoid robot smarter.	Figure’s founder Brett Adcock says a new partnership with OpenAI could help its robots hold conversation and learn from its mistakes over time.
MagicLab’s humanoid can toast marshmallows, fold clothes, and dance.	Miniature high-torque servo actuators combined with sensitive multi-dimensional pressure sensors enabled the team to create an exceptionally dexterous hand–MagicBot.
Amazon to spend $1 billion on startups that combine AI with robots.	Amazon’s $1 billion industrial innovation fund is to step up investments in companies that combine artificial intelligence and robotics, as the e-commerce giant seeks to drive efficiencies across its logistics network.
Claude 3 released.	Three new Claude 3 family models have been trained by Anthropic, the best of which achieves benchmark scores that GPT4 has publicly disclosed. It excels at visual tasks and is a multimodal model as well. Claude's coding skills have significantly improved with this version, which is significant.
ChatGPT can read its answers out loud.	OpenAI’s new Read Aloud feature for ChatGPT could come in handy when users are on the go by reading its responses in one of five voice options out loud to users. It is now available on both the web version of ChatGPT and the iOS and Android ChatGPT apps.
Adobe reveals a GenAI tool for music.	Adobe unveiled Project Music GenAI Control, a platform that can generate audio from text descriptions (e.g. “happy dance,” “sad jazz”) or a reference melody and let users customize the results within the same workflow.
OpenAI fires back at Elon Musk in legal fight over breach of contract claims.	ChatGPT maker releases emails in support of claim businessman backed plan to create for-profit unit
OpenAI and Elon Musk.	In response to Elon Musk's complaint, OpenAI provided screenshots of emails between Elon Musk, Greg Brockman, Sam Altman, and Ilya Sutskever, as well as their version of events. According to the receipts, Musk thought there was little hope for OpenAI to succeed and agreed that some models should be closed-source.
Perplexity AI Reportedly Raising Additional Money At Significantly Higher Valuation Cap Than $520M.	Perplexity AI, a rising star in the field of artificial intelligence, is reportedly in discussions to secure additional funding at a valuation significantly higher than its previous round.
Le Chat.	Using its Mistral models, Mistral AI has introduced 'le Chat Mistral,' a new multilingual conversational assistant with an enterprise edition for companies.
Neuralink brain chip: advance sparks safety and secrecy concerns.	Elon Musk announced this week that his company’s brain implant has allowed a person to move a computer mouse with their mind.
Ex-Google engineer arrested for alleged theft of AI secrets for Chinese firms.	Linwei Ding, facing four counts of theft of trade secrets, accused of transferring confidential information to his personal account
Mistral x Snowflake.	Snowflake, the Data Cloud company, and Mistral AI, one of Europe’s leading providers of AI solutions, today announced a global partnership to bring Mistral AI’s most powerful language models directly to Snowflake customers in the Data Cloud.
Moondream 2 small vision language model.	Moondream is a tiny language model built on SigLIP and Phi-2. The benchmark performance has been much enhanced in this second edition, which is licensed for commercial use. It is perfect for describing visuals and operating on low-end computing hardware.
Driverless startup Waymo to test self-driving vehicles with no human driver in Austin.	Autonomous vehicle company Waymo will begin testing driverless cars, with no human behind the wheel, in Austin, starting Wednesday.
Google brings Stack Overflow’s knowledge base to Gemini for Google Cloud.	Developer Q&A site Stack Overflow is launching a new program today that will give AI companies access to its knowledge base through a new API, aptly named OverflowAPI.
Brave’s Leo AI assistant is now available to Android users.	Brave is launching its AI-powered assistant, Leo, to all Android users. The assistant allows users to ask questions, translate pages, summarize pages, create content, and more. The Android launch comes a few months after Brave first launched Leo on desktop. Brave says Leo will be available on iOS devices in the coming weeks.
Inflection-2.5.	A new model has been introduced by Inflection to power Pi, its personal assistant. The model achieves remarkable reasoning scores on benchmarks and performs within 94% of the GPT-4. In comparison to GPT-4, Inflection claims that training only required 40% of the computing. This post offers an intriguing discovery: a typical conversation with Pi lasts 33 minutes.
Cohere and Accenture Collaborate to Accelerate Enterprise AI Adoption.	Cohere and Accenture are working together to provide over 9,000 enterprise clients with cohere embedding technology.
Microsoft’s Mistral deal beefs up Azure without spurning OpenAI.	Microsoft investing in Mistral puts the focus on its Azure model offerings.

Resources

Link	description
2.4x faster Gemma + 58% less VRAM.	You can now finetune Gemma 7b 2.43x faster than HF + Flash Attention 2 with 57.5% less VRAM use. When compared to vanilla HF, Unsloth is 2.53x faster and uses 70% less VRAM.
DUSt3R.	With the help of this project, you may create 3D representations in GLB form by taking a few photos of a site and reconstructing it for usage in 3D applications.
Datasets for Large Language Models: A Comprehensive Survey.	an extensive (more than 180 pages) review and analysis of LLM datasets.
Large Language Models(LLMs) on Tabular Data: Prediction, Generation, and Understanding -- A Survey.	an overview of LLMs for tabular data jobs that includes important methods, measurements, datasets, models, and optimization strategies; it also discusses unmet issues and offers suggestions for future lines of inquiry.
Using Claude 3 Opus for video summarization.	Andrej Karpathy challenged me to write a blog article based on one of his latest videos in a lengthy context. This job was completed by Claude 3 with assistance from some pre-processing data. The end product is an excellent and captivating blog post.
Dual-domain strip attention for image restoration.	A new technique that greatly enhances image restoration tasks is the dual-domain strip attention mechanism.
Open-Sora-Plan.	This project aim to reproducing Sora (Open AI T2V model), but we only have limited resource. We deeply wish the all open-source community can contribute to this project.
ML system design: 300 case studies to learn from.	We put together a database of 300 case studies from 80+ companies that share practical ML use cases and learnings from designing ML systems.
orca-math-word-problems-200k .	This dataset contains ~200K grade school math word problems. All the answers in this dataset are generated using Azure GPT4-Turbo. Please refer to Orca-Math: Unlocking the Potential of SLMs in Grade School Math for details about the dataset construction.
mlx-swift-examples.	Apple created the MLX framework, which is used to train AI models on Macs. This repository demonstrates how to use Swift for model training on mobile devices. An MNIST classifier model can be trained with just one on an iPhone.
Text Clustering.	A free and open-source text clustering tool that makes it simple and rapid to embed, cluster, and semantically label clusters. On 100k samples, the full pipeline runs in 10 minutes.
EasyLM.	Large language models (LLMs) made easy, EasyLM is a one-stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Flax. EasyLM can scale up LLM training to hundreds of TPU/GPU accelerators by leveraging JAX's pjit functionality.
You can now train a 70b language model at home.	Today, we’re releasing Answer.AI’s first project: a fully open-source system that, for the first time, can efficiently train a 70b large language model on a regular desktop computer with two or more standard gaming GPUs (RTX 3090 or 4090). This system, which combines FSDP and QLoRA, is the result of a collaboration between Answer.AI, Tim Dettmers (U Washington), and Hugging Face’s Titus von Koeller and Sourab Mangrulkar.
Training Models at Scale.	The goal of this tutorial is to provide a comprehensive overview of techniques and strategies used for scaling deep learning models and to provide a hands-on guide to implement these strategies from scratch in JAX with Flax using shard_map.
Genstruct 7B.	Genstruct 7B is an instruction-generation model, designed to create valid instructions given a raw text corpus. This enables the creation of new, partially synthetic instruction finetuning datasets from any raw-text corpus.
Fructose.	Fructose is a Python package to create a dependable, strongly typed interface around an LLM call.
Efficient Multi-Head Attention Implementations.	Different implementations of the widely used multi-headed attention module in contemporary LLMs varied in speed by over ten times. This notebook lists a handful and compares how well they perform.
US regulators investigate whether OpenAI investors were misled, say reports.	Internal communications from CEO Sam Altman reportedly under scrutiny in SEC inquiry
Microsoft introduces Copilot AI chatbot for finance workers in Excel and Outlook.	Microsoft is launching a Copilot for Finance, which it said will be able to perform a handful of common role-specific actions in Excel and Outlook.

Perspectives

Link	description
On the Societal Impact of Open Foundation Models.	a position paper that centers on open foundation models and discusses their advantages, disadvantages, and effects; it also suggests a framework for risk analysis and clarifies why, in certain situations, the marginal risk of these models is low. Finally, it provides a more sober evaluation of the open foundation models' effects on society.
Towards Long Context RAG.	The amazing one-million-word context window that Google's Gemini 1.5 Pro has brought to the AI community has sparked a debate regarding the future viability of retrieval-augmented generation (RAG).
Aggregator’s AI Risk.	The impact of the Internet, especially through Aggregators like Google and Meta, is comparable to that of the printing press on the spread of knowledge and the establishment of nation-states. However, the rise of generative AI puts the Aggregator model to the test by offering unique solutions that represent ingrained worldviews. This could undermine the Aggregator economics's universal appeal and point to the need for a move toward personalized AI in order to preserve its dominance.
Is Synthetic Data the Key to AGI?.	The caliber of training data has a major impact on how effective large language models are. By 2027, projections indicate that there will be a shortage of high-quality data. A possible answer to this problem is synthetic data generation, which could change internet business models and emphasize the significance of fair data access and antitrust laws.
AI Research Internship Search as a CS PhD Student.	Tips and thoughts from my relatively successful summer research internship hunt during third-year Computer Science PhD study.
How AI Could Disrupt Hollywood.	New platforms and tools may allow a person to create a feature-length film from their living room. But can they really compete with the studios?
Training great LLMs entirely from ground zero in the wilderness as a startup.	Reka's creator and well-known GPU critic Yi Tay detailed their experience building very powerful language models outside of Google in a blog post. The primary obstacles stem from hardware instability and cluster issues. They also had difficulties with software maturity.
Claude 3 Is The Most Human AI Yet.	Anthropic's Claude 3, a large language model similar to GPT-4, is notable not so much for its cost-effectiveness or benchmark test results as for its distinctly human-like, creative, and naturalistic interaction quality. This represents a major breakthrough in AI's capacity to collaborate imaginatively with writers.
Licensing AI Means Licensing the Whole Economy.	AI is a vast process employing statistical approaches, and it would be impractical to control its use across all organizations. Therefore, regulating AI like a tangible commodity is incorrect. Given AI's imminent economic ubiquity, targeted regulation for particular misuses—akin to current strategies for programming or email abuses—is more successful.
Is ChatGPT making scientists hyper-productive? The highs and lows of using AI.	Large language models are transforming scientific writing and publishing. However, the productivity boost that these tools bring could have a downside.
Artificial intelligence and illusions of understanding in scientific research.	Why are AI tools so attractive and what are the risks of implementing them across the research pipeline? Here we develop a taxonomy of scientists’ visions for AI, observing that their appeal comes from promises to improve productivity and objectivity by overcoming human shortcomings.
AI will likely increase energy use and accelerate climate misinformation – report.	Claims that artificial intelligence will help solve the climate crisis are misguided, warns a coalition of environmental groups
We Need Self-Driving Cars.	Anyone rooting against self-driving cars is cheering for tens of thousands of deaths, year after year. We shouldn’t be burning self-driving cars in the streets. We should be celebrating…
Subprime Intelligence.	Significant problems in OpenAI's Sora demonstrate the limitations of generative AI's comprehension. The technology presents both practical obstacles and revolutionary possibilities, as seen by its high computing needs and potential impact on the creative industry.
Sora, Groq, and Virtual Reality.	A few years ago, Facebook's drive into the metaverse looked misguided, and the idea of the metaverse appeared like fiction from Ernest Cline's novel. Things feel different now. Groq's deterministic circuits streamline machine-learning algorithms for quicker processing, while Sora creates intricate video situations. The combination of these developments brings us one step closer to real-time video simulation and full-fledged virtual reality.
AI Is Like Water.	For GenAI companies to have a competitive advantage, technology alone is no longer sufficient. This means that since the basic product is virtually the same, GenAI and bottled water are comparable. The primary differentiators need to originate from elements like distribution, user experience, perceived customer value, branding, and marketing.

Back to index

ML news: Week 26 February - 3 March

Research

Link	description
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs.	The RL approach REINFORCE is straightforward, well-known, and simple to comprehend. In simulators, training steadily is a challenge. In general, PPO is far more reliable and performant. REINFORCE is used by Gemini, and PPO is presumably used by GPT-4.
AlphaFold Meets Flow Matching for Generating Protein Ensembles.	The protein's post-folding state can be predicted using AlphaFold. Adding invertible flow matching allows you to significantly increase modeling capability throughout the whole protein landscape.
Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models.	Researchers have created a new technique that focuses on "expert-level sparsification," which minimizes model size without sacrificing performance, to make LLMs more effective and user-friendly. For Mixture-of-Experts LLMs, which are strong but typically too large to manage simply, this is very helpful.
Towards Generalizable Hand-Object Interaction Denoising via Denoising Diffusion.	A novel method called GeneOH Diffusion enhances models' comprehension of and ability to manipulate objects with their hands. The goal of this technique is to improve the naturalness of these interactions by fixing mistakes in hand gestures and object relationships.
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis.	Except Sora, Snap Research has developed a video creation model that is 3 times faster to run than the prior state of the art.
OpenCodeInterpreter.	By training on a synthetic multi-turn dataset and utilizing human feedback, a model built on CodeLlama and DeepSeek Coder was able to achieve 85%+ on the HumanEval programming benchmark.
INSTRUCTIR: A Benchmark for Instruction Following of Information Retrieval Models.	A new benchmark called INSTRUCTIR aims to improve search engines' ability to infer users' intentions. INSTRUCTIR assesses how well search engines can obey user instructions and adjust to different and evolving search needs, in contrast to existing approaches that primarily concentrate on the query itself.
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases.	In terms of accuracy in jobs involving contacting API functions, Meta's 350m parameter language model has high reasoning performance, even coming close to Llama 7B. Although the model is not yet available, it is worthwhile to investigate the novelty in fixed parameter models.
ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models.	A new multilingual benchmark called ConceptMath is used to assess LLMs' arithmetic proficiency in both Chinese and English. It's special because it deconstructs arithmetic problems into discrete ideas, enabling a more thorough evaluation of an AI's mathematical prowess and shortcomings.
Generate What You Prefer: Reshaping Sequential Recommendation via Guided Diffusion.	DreamRec proposed a revolutionary 'learning-to-generate' technique to sequential recommendation, whereby it generates a 'oracle' item representing the optimal next option for the user, as opposed to the conventional way of identifying user preferences from a mixture of positive and negative things.
FlowMDM: Seamless Human Motion Composition with Blended Positional Encodings.	A novel model called FlowMDM uses text descriptions to create lengthy, continuous sequences of human movements. This groundbreaking diffusion-based model excels in accuracy and realism on important datasets by using Blended Positional Encodings to create realistic motion without the need for additional denoising stages.
VSP-LLM (Visual Speech Processing incorporated with LLMs).	We propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM), to maximize the context modeling ability by bringing the overwhelming power of LLMs. Specifically, VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation, where the given instructions control the type of task.
Repetition Improves Language Model Embeddings.	We present echo embeddings, an embedding strategy designed to address an architectural limitation of autoregressive models: that token embeddings cannot contain information from tokens that appear later in the input. Echo embeddings resolve this issue by repeating the input twice in the input to the embedding model. Our method has strong performance on MTEB and is compatible with many other methods for improving embedding models.
Range-Agnostic Multi-View Depth Estimation With Keyframe Selection.	Multi-View 3D reconstruction techniques process a set of source views and a reference view to yield an estimated depth map for the latter.
ChatMusician: Understanding and Generating Music Intrinsically with LLM.	Adding a modality-specific encoder to a language model is usually necessary for comprehending music. This is unstable and costly. This study demonstrated that tokenizing music into ABC notation significantly boosted music knowledge without affecting basic language proficiency.
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs.	Bytedance has produced a system called MegaScale that can be used to train massively parallel large language models. It succeeded in training a 175B LLM on 12,288 GPUs with 55.2% Model FLOP utilization (MFU), which is extremely impressive. Bytedance plans to open source some aspects of the codebase.
ListT5: Listwise Reranking with Fusion-in-Decoder Improves Zero-shot Retrieval.	ListT5 presents a novel reranking technique that not only increases information retrieval precision but also provides a workable solution to the issues that earlier listwise rerankers encountered.
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT.	Our primary contribution is the introduction of an accurate and fully transparent open-source 0.5 billion (0.5B) parameter SLM, named MobiLlama, catering to the specific needs of resource-constrained computing with an emphasis on enhanced performance with reduced resource demands.
Accurate LoRA-Finetuning Quantization of LLMs via Information Retention.	A novel method called IR-QLoRA improves quantized big language model accuracy, which makes them more appropriate for usage on low-resource devices.
Video as the New Language for Real-World Decision Making.	An incredible research that presents video as a possible improvement over current methods for AI to communicate with humans. It demonstrates the usage of video models as environment simulators, planners, agents, and computation engines.
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits.	A parameter in the majority of language models is represented by 16 bits or more. This produces strong models that may be costly to operate. This study suggests a technique where each parameter is in {-1, 0, 1} and requires 1.58 bits. Performance is precisely matched by this approach up to 3B parameters. Models and codes are not yet available.
Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models.	Enhancing multi-modality foundation models such as GPT-4V in low-level visual perception tasks is the main goal of this research. The extensive study collected comments on 18,973 photos from 58,000 people and produced the Q-Pathway dataset for brightness, color, and clarity analysis.
Graph Diffusion Policy Optimization.	The primary objective of this work is to improve multi-modality foundation models, like GPT-4V, in low-level visual perception tasks. The comprehensive study created the Q-Pathway dataset for brightness, color, and clarity analysis by gathering feedback on 18,973 photographs from 58,000 users.
HiGPT: Heterogeneous Graph Language Model.	A method for learning across many heterogeneous graphs without requiring fine-tuning is called HiGPT. It excels at adapting to different data distributions thanks to its integration with a unique graph tokenizer and a large corpus of graph commands.
PromptMM: Multi-Modal Knowledge Distillation for Recommendation with Prompt-Tuning.	PromptMM uses Multi-modal Knowledge Distillation to enhance recommendation systems on sites like Amazon and TikTok. In order to avoid overfitting, it eliminates errors in user preferences and streamlines systems by extracting key characteristics from different kinds of content (textual, audio, or visual).
Genie: Generative Interactive Environments.	We introduce Genie, a foundation world model trained from Internet videos that can generate an endless variety of playable (action-controllable) worlds from synthetic images, photographs, and even sketches.
UniVS: Unified and Universal Video Segmentation with Prompts as Queries.	With a unique prompt-based methodology, UniVS is a unified architecture for video segmentation that addresses the difficulties of diverse segmentation jobs. UniVS removes the requirement for heuristic inter-frame matching by utilizing prompt characteristics as queries and providing a target-wise prompt cross-attention layer. This allows UniVS to adapt to various video segmentation settings with ease.
Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis.	With a deep semantic knowledge of pictures, the Coarse-to-Fine Latent Diffusion (CFLD) method avoids overfitting and offers a novel Pose-Guided Person Image Synthesis technique that overcomes the drawbacks of previous models.
Evaluating Quantized Large Language Models.	Large language models like OPT and LLaMA2 can be rendered more compute- and memory-efficient through the use of post-training quantization.
Representing 3D sparse map points and lines for camera relocalization.	With minimal memory and processing power, this study presents a novel method for 3D mapping and localization that processes both point and line information using a lightweight neural network, greatly improving pose accuracy.
Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving.	Drive-WM can produce high-quality multiview films to forecast future events, allowing self-driving cars to make more intelligent and safe driving choices.
Do Large Language Models Latently Perform Multi-Hop Reasoning?.	This study delves into the fascinating world of Large Language Models (LLMs) and their ability to engage in multi-hop reasoning, akin to human thought processes. By crafting intricate prompts like "The mother of the singer of 'Superstition' is", researchers probe how LLMs navigate complex queries. They uncover compelling evidence suggesting that these models can indeed perform multi-hop reasoning, often relying on a bridge entity like Stevie Wonder to connect disparate pieces of information. The findings highlight both the strengths and limitations of LLMs in this regard, offering valuable insights for their future development and application.

News

Link	description
Microsoft reportedly makes AI server gear to cut Nvidia dependence.	Microsoft is creating its own AI server hardware to intensify actions to decrease its dependency on Nvidia, according to a source familiar with the matter speaking to The Information.
‘Embarrassing and wrong’: Google admits it lost control of image-generating AI.	Google has apologized (or come very close to apologizing) for another embarrassing AI blunder this week, an image-generating model that injected diversity into pictures with a farcical disregard for historical context. While the underlying issue is perfectly understandable, Google blames the model for “becoming” oversensitive.
Is OpenAI the next challenger trying to take on Google Search?	The Information says OpenAI is working on web search (partially powered by Bing) that would more directly compete with Google. It’s unclear if it would be standalone, or a part of ChatGPT.
Transformer Circuits Thread - Updates - February 2024.	The research experts at Anthropic have been developing a Circuit-based approach to comprehend deep neural networks. These circuits seek to pinpoint model components that are employed in particular applications. Every month, the research team publishes an update on the trials they conducted and the
A new tool targets voter fraud in Georgia – but is it skirting the law?.	A tech company supported by Trump’s former lawyer is injecting chaos into the state’s vote-counting process
Democratic political operative admits he commissioned robocall of AI Biden.	Steve Kramer said ‘easy-to-use technology’ enabled him to send automated call while New Orleans magician says he was paid $150 to make it
Mistral Large.	Mistral Large is our new cutting-edge text generation model. It reaches top-tier reasoning capabilities. It can be used for complex multilingual reasoning tasks, including text understanding, transformation, and code generation. Mistral Large achieves strong results on commonly used benchmarks, making it the world's second-ranked model generally available through an API (next to GPT-4)
Scale AI to set the Pentagon’s path for testing and evaluating large language models .	The company will create a comprehensive T&E framework for generative AI within the Defense Department.
DatologyAI is building tech to automatically curate AI training datasets.	Morcos’ company, DatologyAI, builds tooling to automatically curate datasets like those used to train OpenAI’s ChatGPT, Google’s Gemini and other like GenAI models. The platform can identify which data is most important depending on a model’s application (e.g. writing emails), Morcos claims, in addition to ways the dataset can be augmented with additional data and how it should be batched, or divided into more manageable chunks, during model training.
Bay Bridge: A supercomputer built for startups.	With flexible short-term renting options, San Francisco Compute Company is now providing the lowest-cost H100 training clusters in the world to customers who require intensive computing for AI model training but do not want to commit to long-term agreements. Its first cluster, Angel Island, is operational at the moment, and Bay Bridge will follow shortly. The unique business strategy of SF Compute places a premium on cost and accessibility for AI entrepreneurs without requiring long-term commitments.
mlabonne/AlphaMonarch-7B.	AlphaMonarch-7B is a new DPO merge that retains all the reasoning abilities of the very best merges and significantly improves its conversational abilities. Kind of the best of both worlds in a 7B model.
LazyAxolotl.	This notebook allows you to fine-tune your LLMs using Axolotl and Runpod
Apple’s electric car project is dead.	After a decade of work, the company is reportedly giving up on its ambitious effort to create an autonomous electric car.
Expressive Whole-Body Control for Humanoid Robots.	UCSD researchers trained robust, socially inclined, expressive policies for humanoid robots. Their unchoreographed dancing on grass videos is quite amazing.
Meta plans launch of new AI language model Llama 3 in July, The Information reports.	Meta Platforms (META.O), opens new tab is planning to release the newest version of its artificial-intelligence large language model Llama 3 in July which would give better responses to contentious questions posed by users, The Information reported on Wednesday.
Tim Cook Says Apple Will 'Break New Ground' in Generative AI.	Cook said that the company will "break new ground" in generative AI in 2024. "We believe it will unlock transformative opportunities for our users," said Cook.
Elon Musk sues OpenAI accusing it of putting profit before humanity.	Lawsuit says chief executive Sam Altman’s deal with Microsoft has broken organization’s mission
Figure raises $675M at $2.6B valuation.	In order to continue developing humanoid robots, Figure, a robotics startup, has secured $675 million from a number of significant investors, including OpenAI.

Resources

Link	description
Pearl - A Production-ready Reinforcement Learning AI Agent Library.	Pearl is a new production-ready Reinforcement Learning AI agent library open-sourced by the Applied Reinforcement Learning team at Meta. Pearl enables to development Reinforcement Learning AI agents.
Large Language Models for Data Annotation: A Survey.	This is a curated list of papers about LLM for Annotation
Automotive Object Detection with Spiking Neural Networks (SNNs).	One novel and effective model for autonomous cars is Spiking Neural Networks. High performance is attained using up to 85% less energy.
Berkeley function calling leaderboard.	When a language model can access resources through synthesized functions to carry out commands, this is known as function calling. To pass to such functions, the parameters must be properly synthesized. The purpose of this leaderboard is to evaluate the model's performance on function-calling tasks.
FuseChat.	FuseChat is a novel approach to combine the advantages of many huge language models into a single, more potent model without having to pay expensive training fees again.
ShieldLM .	ShieldLM is a bilingual (Chinese and English) safety detector that mainly aims to help detect safety issues in LLMs' generations. It aligns with general human safety standards, supports fine-grained customizable detection rules, and provides explanations for its decisions.
Enable decision-making based on LLM-based simulations.	An open-source project called Simulatrex is dedicated to GABM or generative agent-based modeling. Large language models are used to provide more accurate simulations.
Training-Free Long-Context Scaling of Large Language Models.	Dual chunk attention is a training-free and effective method for extending the context window of large language models (LLMs) to more than 8x times their original pre-training length. We refer to the Llama-based model with dual chunk attention as ChunkLlama.
DPO to encourage descriptiveness.	A minimal code set up with TRL to tune a model to be more descriptive.
Shape suffixes for ML coding.	The readable nature of shapes in tensors is significantly enhanced by a coding style at Character AI.
Getting started with MAX Developer Edition.	To drastically reduce complexity and accelerate AI implementations, Modular developed the MAX toolset. It is currently accessible.
Bonito.	Bonito is an open-source model for conditional task generation: the task of converting unannotated text into task-specific training datasets for instruction tuning. This repo is a lightweight library for Bonito to easily create synthetic datasets built on top of the Hugging Face transformers and vllm libraries.
Awesome-LLMs-for-Video-Understanding.	A selection of helpful resources for comprehending videos with huge language models can be found in this repository.
Mist text to speech.	A new text-to-speech technology called Rime has strong conversational capabilities. This model may incorporate "ums" and realistic pauses, in contrast to earlier ones.
Add your own Ollama models.	Guidelines for contributing your own models to the Ollama repository for public usage.
2x speed up HF inference with static KV Cache.	Increased inference speed can lead to new use cases. This code proposes a method to accelerate Hugging Face inference using Llama models.

Perspectives

Link	description
Sam Altman Wants $7 Trillion.	In order to meet the fast-rising costs of developing generative AI models such as GPT, Sam Altman has proposed a $7 trillion budget, indicating an exponential increase in resources required for further iterations. This goal highlights a critical juncture in the development of AI, striking a balance between the quickening pace of scientific improvement and its wider effects on safety and societal preparedness.
Ten AI Insights from Databricks, Anyscale, and Microsoft.	This article features interviews with founders of AI-forward firms, including their perspectives on the emergence of artificial intelligence (AGI), how to approach LLMs, and basic strategies for entrepreneurs integrating AI into their products.
What the EU’s tough AI law means for research and ChatGPT.	The EU AI Act is the world’s first major legislation on artificial intelligence and strictly regulates general-purpose models.
Online images amplify gender bias.	We find that gender bias is consistently more prevalent in images than text for both female- and male-typed categories. We also show that the documented underrepresentation of women online is substantially worse in images than in text, public opinion, and US census data.
ChunkLlama.	Dual chunk attention is a training-free and effective method for extending the context window of large language models (LLMs) to more than 8x times their original pre-training length. We refer to the Llama-based model with dual chunk attention as ChunkLlama.
distilabel.	AI Feedback (AIF) framework for building datasets with and for LLMs.
StarCoder2.	StarCoder2-15B model is a 15B parameter model trained on 600+ programming languages from The Stack v2, with opt-out requests excluded.
The paradox of diffusion distillation.	Diffusion models decompose complex issues, such as image production, into numerous smaller issues, such as minimizing a small amount of noise in an image. Single-step diffusion generation has received a lot of attention, however it appears to miss the mark. This article examines the diffusion distillation conundrum and lists the various avenues of inquiry that might be pursued.

Back to index

ML news: Week 19 - 25 February

Research

Link	description
Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning.	Deciding which examples to employ when aligning language models with preference data is frequently difficult. This paper proposes an unexpectedly strong baseline: pick the 1,000 longest cases.
Extreme Video Compression with Pre-trained Diffusion Models.	As diffusion models get more adept at synthesizing pictures and videos, they may be used for other purposes due to their extensive "knowledge" of the world. This study discovered an astounding 0.02 bits per pixel reduction. The secret here was to track perceptual similarities along the route and deliver a new frame of the original movie as necessary.
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset.	To train open-source Large Language Models in math that equal the performance of closed-source models, researchers have developed a new dataset called OpenMathInstruct-1. With 1.8 million problem-solution pairings, this innovation paves the way for more competitive and approachable AI systems for math teaching.
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization.	A feature of the Transformer design that allows it to consume less memory at inference time is the quantization of the KV cache. The process of decreasing floating point accuracy with the least amount of quality loss is called quantization.
Pushing the Limits of Zero-shot End-to-End Speech Translation.	ZeroSwot is a novel approach to voice Translation (ST) that addresses the data scarcity and distinctions between text and voice. It may operate with a multilingual translation model by using special strategies to train a voice encoder using only speech recognition data.
Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE).	A novel technique called SpLiCE simplifies the complicated visual data in CLIP.
TDViT: Temporal Dilated Video Transformer for Dense Video Tasks.	A novel Temporal Dilated Video Transformer (TDViT) has been created to enhance the analysis of tasks involving dense videos, like object detection in videos frame by frame.
Generative Representational Instruction Tuning.	A model that creates embeddings and text has been trained and released by the Contextual team. It performs noticeably better than a single specialized model. With embedding as the output modality, the model offers an intriguing interpretation of the multi-modal trend.
LoRA+: Efficient Low-Rank Adaptation of Large Models.	To improve on the current Low-Rank Adaptation (LoRA) technique for fine-tuning big models, this work introduces LoRA+. By applying multiple learning rates for important process components, LoRA+ achieves improved performance and faster fine-tuning without raising processing loads.
GaussianObject: Just Taking Four Images to Get A High-Quality 3D Object with Gaussian Splatting.	We propose GaussianObject, a framework to represent and render the 3D object with Gaussian splatting, that achieves high rendering quality with only 4 input images.
MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single to Sparse-view 3D Object Reconstruction.	This paper presents a neural architecture MVDiffusion++ for 3D object reconstruction that synthesizes dense and high-resolution views of an object given one or a few images without camera poses.
ChatterBox: Multi-round Multimodal Referring and Grounding.	A vision-language model called ChatterBox performs exceptionally well in multimodal dialogues, particularly in the recently defined job of multimodal multi-round referring and grounding.
Large language models streamline automated machine learning for clinical studies.	A knowledge gap persists between machine learning developers and clinicians. Here, the authors show that the Advanced Data Analysis extension of ChatGPT could bridge this gap and simplify complex data analyses, making them more accessible to clinicians.
Extracting accurate materials data from research papers with conversational language models and prompt engineering.	Efficient data extraction from research papers accelerates science and engineering. Here, the authors develop an automated approach that uses conversational large language models to achieve high precision and recall in extracting materials data.
GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis.	GradSafe is a novel technique that can identify dangerous prompts in big language models without requiring a lot of training. Compared to existing approaches, it can identify dangerous prompts more accurately by examining the gradients of certain parameters.
Class-Aware Mask-Guided Feature Refinement for Scene Text Recognition.	A novel technique called Class-Aware Mask-guided (CAM) feature refinement improves text recognition in challenging environments.
Object Recognition as Next Token Prediction.	an innovative approach to object recognition that makes use of a language decoder. With this approach, text tokens are predicted from picture embeddings by using a customized non-causal attention mask. It makes it possible to sample many labels in parallel effectively.
TIER: Text and Image Encoder-based Regression for AIGC Image Quality Assessment.	To evaluate the quality of the generated images, TIER makes use of both written prompts and the images that result from them.
Large Language Models for Data Annotation: A Survey.	a taxonomy of techniques that use LLMs for data annotation; covers three aspects: LLM-based data annotation, evaluating LLM-generated annotations, and learning using LLM-generated annotations. an overview and a decent list of references that utilize LLMs for data annotation.
Generative Representational Instruction Tuning.	provides new state-of-the-art on MTEB and the unification is reported to speed up RAG by 60% for long documents. It does this by using generative representational instruction tuning, in which an LLM is trained to perform both generative and embedding tasks and designed to distinguish between them via the instructions.
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs.	demonstrates that a more straightforward version of REINFORCE performs better than both PPO and recently suggested alternatives like DPO and RAFT; all in all, it demonstrates that online RL optimization may be advantageous and inexpensive. Moreover, it demonstrates that many of the components of PPO are superfluous in an RLHF setting.
In Search of Needles in an 11M Haystack: Recurrent Memory Finds What LLMs Miss.	finds that both GPT-4 and RAG performance significantly depend on the first 25% of the input, suggesting that there is room for improved context processing mechanisms. It also reports that recurrent memory augmentation of transformer models achieves superior performance on documents of up to 10 million tokens. This study investigates the capability of transformer-based models in extremely long context processing.
When is Tree Search Useful for LLM Planning? It Depends on the Discriminator.	examines the methods used by LLM to solve multi-step issues using a framework that includes a generator, discriminator, and planner technique (such as tree search and iterative correction); reveals that, although existing LLMs do not exhibit these discrimination skills, planning approaches require discriminators with at least 90% accuracy; furthermore, it is shown that, although tree search performs well, it is at least 10–20 times slower and therefore unsuitable for real-world applications.
Chain-of-Thought Reasoning Without Prompting.	claims to significantly improve a model's reasoning capabilities over greedy decoding across reasoning benchmarks and suggests a chain-of-thought (CoT) decoding method to elicit reasoning capabilities from pre-trained LLMs without explicit prompting. It also finds that the presence of CoT in the decoding path increases the model's confidence in its final answer.
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement.	a set of free and open-source tools for writing, running, and improving code iteratively; includes 68K multi-turn interaction dataset; combines human input with code execution for dynamic code refinement; achieves excellent results on benchmarks such as HumalEval and EvalPlus.

News

Link	description
Anthropic takes steps to prevent election misinformation.	Called Prompt Shield, the technology, which relies on a combination of AI detection models and rules, shows a pop-up if a U.S.-based user of Claude, Anthropic’s chatbot, asks for voting information. The pop-up offers to redirect the user to TurboVote, a resource from the nonpartisan organization Democracy Works, where they can find up-to-date, accurate voting information.
OpenAI's next AI product could be after your job (again).	OpenAI is said to be developing AI agents that automate even more complex tasks, though their launch timeline remains unknown. One AI agent is said to take over the customer’s device to perform tasks like transferring data from a document to a spreadsheet, filling out expense reports, and entering them into accounting software. The other AI agent is said to perform more research-oriented, web-based tasks, such as creating itineraries and booking flight tickets.
Our next-generation model: Gemini 1.5.	In fact, we’re ready to introduce the next generation: Gemini 1.5. It shows dramatic improvements across several dimensions and 1.5 Pro achieves comparable quality to 1.0 Ultra while using less computing.
OpenAI on track to hit $2bn revenue milestone as growth rockets.	Thanks in large part to ChatGPT's enormous success, OpenAI has reached an annual revenue run rate of over $2 billion, making it one of the fastest-growing tech companies.
Sam Altman wants Washington's backing for his $7 trillion AI chip venture.	The OpenAI CEO is working to secure US government approval for the project as it risks raising national security and antitrust concerns, Bloomberg reported.
‘Gemini Business’ and ‘Gemini Enterprise’ plans for Google Workspace are coming.	The upcoming changelog — as spotted by Testing Catalog and Dylan Roussel on X/Twitter today — reveals the existence of “Gemini Business” and “Gemini Enterprise” plans. This will give “Google Workspace customers access to one of Google’s most capable Al models, 1.0 Ultra in Gemini, and enterprise-grade data protections.”
OpenAI Reaches $80 Billion Valuation In Venture Firm Deal, Report Says.	OpenAI inked a deal with venture capital firm Thrive Capital that boosted its valuation to $80 billion or more, the New York Times reported, a nearly threefold increase in value from just nine months ago.
Magic raises $117m to continue code generation models.	We've raised $117M to build an AI software engineer.
SoftBank Founder Masayoshi Son Aims to Raise $100 Billion for New Chip Venture, "Izanagi".	Masayoshi Son, the visionary founder of SoftBank Group Corp., has set his sights on revolutionizing the semiconductor industry with the launch of Izanagi, a groundbreaking chip venture backed by a staggering $100 billion investment.
Scribe $25M Series B.	To further its AI-driven platform, Scribe has secured a Series B fundraising round headed by Redpoint Ventures. This round aims to speed up the generation of visual step-by-step tutorials and enable knowledge exchange between enterprises.
Amazon AGI Team Say Their AI Is Showing “Emergent Abilities”.	"Big Adaptive Streamable TTS with Emergent Abilities" (BASE TTS), a language model created by Amazon AGI researchers, exhibits "state-of-the-art naturalness" in conversational text and demonstrates language skills that it wasn't particularly trained on.
Gemma: Introducing new state-of-the-art open models.	We’re releasing model weights in two sizes: Gemma 2B and Gemma 7B. Each size is released with pre-trained and instruction-tuned variants. Ready-to-use Colab and Kaggle notebooks, alongside integration with popular tools such as Hugging Face, MaxText, NVIDIA NeMo, and TensorRT-LLM, make it easy to get started with Gemma.
Reddit has a new AI training deal to sell user content.	Over a decade of valuable user content is now for sale as Reddit preps to go public.
Apple Developing AI Tool to Help Developers Write Code for Apps.	Apple is working on an updated version of Xcode that will include an AI tool for generating code, reports Bloomberg. The AI tool will be similar to GitHub Copilot from Microsoft, which can generate code based on natural language requests and convert code from one programming language to another.
Stable Diffusion 3.	Announcing Stable Diffusion 3 in early preview, our most capable text-to-image model with greatly improved performance in multi-subject prompts, image quality, and spelling abilities.
How Bret Taylor’s new company is rethinking customer experience in the age of AI.	The two founders fundamentally see AI agents as a new technology category, providing an entirely new way for customers to interact with brands to improve their overall experience.
Introducing Phind-70B – closing the code quality gap with GPT-4 Turbo while running 4x faster.	We're excited to announce Phind-70B, our largest and most performant model to date. Running at up to 80 tokens per second, Phind-70B gives high-quality answers for technical topics without making users make a cup of coffee while they wait. Phind-70B scores 82.3% on HumanEval, beating the latest GPT-4 Turbo (gpt-4-0125-preview) score of 81.1% in our evaluation.
Marqo Raises $12.5 Million to Help Businesses Build Generative AI Applications.	Marqo has raised $12.5 million in a Series A funding round to advance the adoption of its search platform that helps businesses build generative artificial intelligence (AI) applications that are more relevant and up to date.

Resources

Link	description
minbpe.	Minimal, clean code for the (byte-level) Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization. The BPE algorithm is "byte-level" because it runs on UTF-8 encoded strings.
GPTScript.	GPTScript is a new scripting language to automate your interaction with a Large Language Model (LLM), namely OpenAI. The ultimate goal is to create a fully natural language-based programming experience. The syntax of GPTScript is largely natural language, making it very easy to learn and use.
QWEN.	We opensource our Qwen series, now including Qwen, the base language models, namely Qwen-1.8B, Qwen-7B, Qwen-14B, and Qwen-72B, as well as Qwen-Chat, the chat models, namely Qwen-1.8B-Chat, Qwen-7B-Chat, Qwen-14B-Chat, and Qwen-72B-Chat.
Sora Reference Papers.	A collection of all papers referenced in OpenAI's "Video generation models as world simulators"
repeng.	Control vectors are a low-cost means of controlling the output of semantic generation. Compared to LoRA, they are less expensive to train yet may still be fairly powerful. It's made simpler with this library.
OpenRLHF.	This is a Ray-based implementation of RLHF for Mistral and other Llama-style models. Several PPO stabilizing techniques are included to enhance performance.
3D Diffuser Actor: Policy Diffusion with 3D Scene Representations.	To enhance robot manipulation, the 3D Diffuser Actor blends 3D scene representations with diffusion strategies. Robots are better able to comprehend and engage with their surroundings thanks to this AI-driven method.
How to jointly tune learning rate and weight decay for AdamW.	AdamW is often considered a method that decouples weight decay and learning rate. In this blog post, we show that this is not true for the specific way AdamW is implemented in Pytorch. We also show how to adapt the tuning strategy to fix this: when doubling the learning rate, the weight decay should be halved.
OpenLLMetry-JS.	OpenLLMetry-JS is a set of extensions built on top of OpenTelemetry that gives you complete observability over your LLM application. Because it uses OpenTelemetry under the hood, it can be connected to your existing observability solutions - Datadog, Honeycomb, and others.
List of GPU clusters for rent.	a list of entire clusters that can be rented on an hourly basis.
Mamba: The Hard Way.	A detailed description of how Mamba works
new benchmark for large language models.	It's a collection of nearly 100 tests I've extracted from my actual conversation history with various LLMs.
BoCoEL.	Bayesian Optimization as a Coverage Tool for Evaluating LLMs. Accurate evaluation (benchmarking) is 10 times faster with just a few lines of modular code.
FiT: Flexible Vision Transformer for Diffusion Model.	This repo contains PyTorch model definitions, pre-trained weights, and sampling code for our flexible vision transformer (FiT). FiT is a diffusion transformer-based model that can generate images at unrestricted resolutions and aspect ratios.
RobustVLM.	To defend multi-modal models like OpenFlamingo and LLaVA against visual adversarial assaults, a novel technique is presented in this study. The authors successfully defend these models against manipulative picture assaults by fine-tuning the CLIP visual encoder in an unsupervised way, increasing the models' dependability and security in practical applications without requiring complete model retraining.
HELM Instruct: A Multidimensional Instruction Following Evaluation Framework with Absolute Ratings.	A popular benchmark called Holistic Evaluation of Language Models (HELM) was issued by the Stanford language modeling group. Additionally, they created HELM-Instruct, a version for instruction following. It is absolute, open-ended, and multifaceted.
LoRA Land: Fine-Tuned Open-Source LLMs that Outperform GPT-4.	We’re excited to release LoRA Land, a collection of 25 fine-tuned Mistral-7b models that consistently outperform base models by 70% and GPT-4 by 4-15%, depending on the task. This collection of specialized fine-tuned models–all trained with the same base model–offers a blueprint for teams seeking to efficiently and cost-effectively deploy highly performant AI systems.
Multimodal LLM’s Ability to Understand Visual Data.	A new tool called ChartX is designed to assess how well multi-modal large language models (MLLMs) can understand and make sense of visual charts.
A Critical Evaluation of AI Feedback for Aligning Language Models.	The efficacy of integrating reinforcement learning with supervised fine-tuning in training is questioned in this repository. The more involved two-step technique can be outperformed by first training with a more sophisticated model, such as GPT-4.
MMCSG Dataset.	The MMCSG (Multi-Modal Conversations in Smart Glasses) dataset comprises two-sided conversations recorded using Aria glasses, featuring multi-modal data such as multi-channel audio, video, accelerometer, and gyroscope measurements. This dataset is suitable for research in areas like automatic speech recognition, activity detection, and speaker diarization.
MultiLora inference server.	One base model can have many LoRAs hot-swapped onto it using the Lorax inference server. This allows a large variety of model tunes to be supported with a significant reduction in RAM use.
GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations.	GTBench is a language-driven environment, evaluating the strategic reasoning limitations of LLMs through game-theoretic tasks. GTBench is built on top of OpenSpiel, supporting 10 widely-recognized games
CrewAI.	A library called CrewAI is available for creating and managing AI agents that make use of Replit and LangChain. It offers an easy-to-integrate modular setup comprising tasks, agents, crews, and tools for a variety of applications. LangSmith improves performance insights into non-deterministic LLM calls while streamlining the debugging process.
gemma.cpp.	gemma.cpp is a lightweight, standalone C++ inference engine for the Gemma foundation models from Google.
MMedLM.	The official codes for "Towards Building Multilingual Language Model for Medicine".
LLM Evaluation Metrics for Labeled Data.	How to measure the performance of LLM applications with ground truth data.

Perspectives

Link	description
The data revolution in venture capital.	Investors, data scientists, and tool builders leading the data-driven future of venture capital.
The Three C's: Creativity, Collaboration, and Communication.	The way we communicate, work together, and complete creative projects has changed significantly since the invention of computing. With AI, we're beginning to witness the commencement of another significant change. We undervalue how significant this change will be. Businesses that integrate artificial intelligence (AI) into their products from the start will have a significant edge over those who add it later to already-existing goods.
Inside OpenAI Logan Kilpatrick (head of developer relations).	Have you ever wondered how OpenAI develops and innovates so quickly? The head of developer relations at OpenAI, Logan Kilpatrick, talks about the company's decision-making structure for product launches, high agency and urgency, and OpenAI's distinct culture in this podcast.
Mind-reading devices are revealing the brain’s secrets.	Implants and other technologies that decode neural activity can restore people’s abilities to move and speak — and help researchers understand how the brain works.
Generative AI’s environmental costs are soaring — and mostly secret.	First-of-its-kind US bill would address the environmental costs of the technology, but there’s a long way to go.
Strategies for an Accelerating Future.	With Google's Gemini providing a context window of over a million tokens and Groq's hardware enabling almost instantaneous responses from GPT-3.5 models, among other recent advancements in AI, these represent a significant advancement in practical AI applications and highlight the pressing need for leaders to comprehend and adjust to the rapidly changing AI landscape.
How to lose at Generative AI!	Despite its excitement, generative AI is likely to let most startups down since it benefits established players with data advantages, established workflows, and the capacity to integrate AI without requiring significant system changes. A difficult road lies ahead for startups hoping to make a significant impact in the Generative AI space, even in spite of venture capital flooding the space. These startups are essentially preparing the market for incumbents who can readily adopt and integrate AI innovations into their dominant platforms by concentrating on expeditious engineering and UX improvements at the workflow layer.
Stockholm declaration on AI ethics: why others should sign.	The use of artificial intelligence (AI) in science has the potential to do both harm and good. As a step towards preventing the harm, we have prepared the Stockholm Declaration on AI for Science.
This is why the idea that AI will just augment jobs, never replace them, is a lie!.	AI will automate labor in certain areas. The response thus far has been divided: would increased efficiency allow for more human workers to accomplish the same duties, or will fewer workers be needed? This article compares and contrasts the effects of technology on manufacturing, agriculture, and the contemporary knowledge worker.
LLM evaluation at scale with the NeurIPS Large Language Model Efficiency Challenge.	‍After a year of breakneck innovation and hype in the AI space, we have now moved sufficiently beyond the peak of the hype cycle to start asking a critical question: are LLMs good enough yet to solve all of the business and societal challenges we are setting them up for?

Back to index

ML news: Week 12 - 18 February

Research

Link	description
Skill Set Optimization: Reinforcing Language Model Behavior via Transferable Skills.	It has so far proven difficult to transfer expertise amongst RL agents. An environment-neutral skill set is optimized for this work. Its generalization performance is encouraging.
Self-Play Fine-Tuning (SPIN).	We propose a new fine-tuning method called Self-Play fine-tuning (SPIN), which starts from a supervised fine-tuned model. At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. More specifically, the LLM generates its training data from its previous iterations, refining its policy by discerning these self-generated responses from those obtained from human-annotated data.
Real-World Fluid Directed Rigid Body Control via Deep Reinforcement Learning.	"Box o Flows" addresses the difficulty of replicating complicated fluid dynamics for reinforcement learning (RL) applications by introducing a unique experimental system for testing RL algorithms in dynamic real-world environments. It demonstrates how model-free reinforcement learning algorithms may produce complex behaviors from simple rewards, improve data efficiency through offline reinforcement learning, and open the door to more widespread RL use in complex systems.
WebLINX.	A collection of 100,000 web-based conversations in conversational format is called Weblinx. It was made available to advance research on web-based navigation guided by language models.
ImplicitDeepfake: Plausible Face-Swapping through Implicit Deepfake Generation using NeRF and Gaussian Splatting.	To produce incredibly lifelike 3D avatars, this work presents ImplicitDeepfake1, a novel method that blends deepfake technology with Gaussian Splatting (GS) and Neural Radiance Fields (NeRFs).
AutoMathText: Autonomous Data Selection with Language Models for Mathematical Texts.	Researchers have created a novel method to improve language models' mathematical proficiency by letting base models choose excellent mathematical information on their own.
Complete Instances Mining for Weakly Supervised Instance Segmentation.	A novel method for image segmentation has been presented by researchers that uses just simple image labels to identify particular portions of a picture, such as a dog. They overcame the difficulty of a network identifying many occurrences of the same object by presenting an innovative technique that improves efficiency and lowers mistake rates.
Whispers in the Machine: Confidentiality in LLM-integrated Systems.	The increased pairing of huge language models with external technologies has given rise to new vulnerabilities associated with data breaches. This research offers a methodical way to assess various AI systems' privacy protection efficacy.
This AI learned the language by seeing the world through a baby’s eyes.	An artificial intelligence (AI) model has learned to recognize words such as ‘crib’ and ‘ball’, by studying headcam recordings of a tiny fraction of a single baby’s life. original article.
World Model on Million-Length Video and Language with RingAttention.	This model can correctly respond to queries with a million token video duration using ring attention and an optimized 7B parameter model. It performs exceptionally accurately on retrieval benchmarks and beats commercial VLMs.
LUMIERE - A Space-Time Diffusion Model for Video Generation.	A new text-to-video model from Google can assist in accepting input in the form of images and styles. It diffuses everything simultaneously via a brand-new "space-time UNet."
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction.	With the help of textual descriptions, SEINE is a novel video diffusion model that can expand short AI-generated video clips into larger, narrative-level segments with smooth and creative scene transitions.
Text-Driven Image Editing via Learnable Regions.	Given an input image and a language description for editing, our method can generate realistic and relevant images without the need for user-specified regions for editing. It performs local image editing while preserving the image context. Our method can also handle multiple-object and long-paragraph scenarios.
Video annotator.	The annotation process directly incorporates subject experts thanks to the Video Annotator framework. This novel method increases the accuracy and efficiency of the model by combining human expertise with zero-shot and active learning techniques.
Automated Unit Test Improvement using Large Language Models at Meta.	Meta created tests for its code base using massive language models. It discovered significant gains in overall code quality and test coverage.
Meta’s V-JEPA model.	According to Yann LeCun, VP and Chief AI Scientist at Meta, more data-efficient self-supervised models are required for general intelligence. This approach, which uses models trained on video to comprehend parts of the world, is a first step in that direction. The models can be accessed by the general public.
Extreme Video Compression with Pre-trained Diffusion Models.	Diffusion models have been used by researchers to create a novel video compression technique that produces high-quality video frames at low data rates.

News

Link	description
Laion releases assistant BUD-E.	An open assistant that runs on a gaming laptop and utilizes highly optimized language models and natural voice has been made available by the Laion research group. The project's goal is to offer a capable, low-resource personal assistant that is simple to deploy.
OpenAI Hits $2 Billion Revenue Milestone.	Microsoft-backed OpenAI hit the $2 billion revenue milestone in December. The company's annualized revenue topped $1.6 billion in December based on strong growth from its ChatGPT product, up from $1.3 billion as of mid-October, the Information had reported previously.
AI PCs will make up nearly 60% of total PC shipments by 2027.	Demand for AI PCs to start ramping up this year
The first human received an implant from Neuralink yesterday and is recovering well.	Initial results show promising neuron spike detection.
Reka Flash: An Efficient and Capable Multimodal Language Model.	Reka Flash is a state-of-the-art 21B model trained entirely from scratch and pushed to its absolute limits. It serves as the “turbo-class” offering in our lineup of models. Reka Flash rivals the performance of many significantly larger models, making it an excellent choice for fast workloads that require high quality. On a myriad of language and vision benchmarks, it is competitive with Gemini Pro and GPT-3.5.
Apple releases ‘MGIE’, a revolutionary AI model for instruction-based image editing.	Apple has released a new open-source AI model, called “MGIE,” that can edit images based on natural language instructions. MGIE, which stands for MLLM-Guided Image Editing, leverages multimodal large language models (MLLMs) to interpret user commands and perform pixel-level manipulations. The model can handle various editing aspects, such as Photoshop-style modification, global photo optimization, and local editing.
DeepMind framework offers a breakthrough in LLMs’ reasoning.	A breakthrough approach in enhancing the reasoning abilities of large language models (LLMs) has been unveiled by researchers from Google DeepMind and the University of Southern California. Their new ‘SELF-DISCOVER’ prompting framework – published this week on arXiV and Hugging Face – represents a significant leap beyond existing techniques, potentially revolutionizing the performance of leading models such as OpenAI’s GPT-4 and Google’s PaLM 2.
Meta will start detecting and labeling AI-generated images from other companies.	The feature will arrive on Facebook, Instagram, and Threads in the coming months
Stability and Wurstchen release new text-to-image model.	a new text-to-image model building upon the Würstchen architecture. Stable Cascade is exceptionally easy to train and finetune on consumer hardware thanks to its three-stage approach. In addition to providing checkpoints and inference scripts, we are releasing scripts for finetuning, ControlNet, and LoRA training to enable users further to experiment with this new architecture that can be found on the Stability GitHub page.
Memory and new controls for ChatGPT.	OpenAI is testing a new feature that allows ChatGPT to remember facts across conversations. This can be switched off if desired. It will allow for a higher measure of personalization when interacting with the chat system.
Report: Sam Altman seeking trillions for AI chip fabrication from UAE, others.	On Thursday, The Wall Street Journal reported that OpenAI CEO Sam Altman is in talks with investors to raise as much as $5 trillion to $7 trillion for AI chip manufacturing, according to people familiar with the matter. The funding seeks to address the scarcity of graphics processing units (GPUs) crucial for training and running large language models like those that power ChatGPT, Microsoft Copilot, and Google Gemini.
Meta to deploy in-house custom chips this year to power AI drive.	Facebook owner Meta Platforms plans to deploy into its data centers this year a new version of a custom chip aimed at supporting its artificial intelligence (AI) push, according to an internal company document seen by Reuters on Thursday.
Google Launches €25 Million AI Opportunity Initiative for Skills Training Across Europe.	By investing in AI literacy, infrastructure, and partnerships across sectors, the company hopes to empower broad segments of the workforce with valuable future-proof skills.
The brain area that lights up in prickly people.	Those who are quick to take offense show similar levels of activity in a region of the brain that’s crucial for decision-making.
Disrupting malicious uses of AI by state-affiliated threat actors.	OpenAI discovered and terminated accounts affiliated with nation-states using GPT models for malicious cases.
Andrej Karpathy is leaving OpenAI again — but he says there was no drama.	Andrej Karpathy, a widely respected research scientist, announced today that he has left OpenAI. This is the second time Karpathy has left the top AI firm and his departure is not because of any event, issue, or drama, he said.
NVIDIA’s new AI chatbot runs locally on your PC.	NVIDIA just released a free demo version of a chatbot that runs locally on your PC. This is pretty neat, as it gives the chatbot access to your files and documents. You can feed Chat with RTX a selection of personal data and have it create summaries based on that information. You can also ask it questions, just like any chatbot, and dive into your data for answers.
MAGNeT: Masked Audio Generation using a Single Non-Autoregressive Transformer.	Facebook unveiled an advanced open-source audio model that is 7 times quicker than competing models without compromising on quality. It can produce sound effects and music. The manuscript is now accessible.
MIMIR.	Python package for measuring memorization in LLMs.
Nvidia is now worth as much as the whole Chinese stock market.	Nvidia is now worth the same as the whole Chinese stock market as defined by Hong Kong-listed H-shares, Bank of America chief investment strategist Michael Hartnett pointed out in a new note. The company's market cap has hit $1.7 trillion, the same as all Chinese companies listed on the Hong Kong Stock Exchange. Nvidia's stock soared 239% in 2023 and is up 41% in 2024, through Thursday.
OpenAI Sora.	A new video-generating model with amazing quality was revealed by OpenAI. Red teamers are allowed to test it right now.
Lambda Raises $320M To Build A GPU Cloud For AI.	Lambda’s mission is to build the #1 AI compute platform in the world. To accomplish this, we’ll need lots of NVIDIA GPUs, ultra-fast networking, lots of data center space, and lots of great new software to delight you and your AI engineering team.
USPTO says AI models can’t hold patents.	The United States Patent and Trademark Office (USPTO) published guidance on inventorship for AI-assisted inventions, clarifying that while AI systems can play a role in the creative process, only natural persons (human beings) who make significant contributions to the conception of an invention can be named as inventors. It also rules out using AI models to churn out patent ideas without significant human input.

Resources

Link	description
RLX: Reinforcement Learning with MLX.	RLX is a collection of Reinforcement Learning algorithms implemented based on the implementations from CleanRL in MLX, Apple's new Machine Learning framework.
llmware.	llmware is a unified framework for developing LLM-based application patterns including Retrieval Augmented Generation (RAG). This project provides an integrated set of tools that anyone can use - from a beginner to the most sophisticated AI developer - to rapidly build industrial-grade, knowledge-based enterprise LLM applications with a specific focus on making it easy to integrate open-source small specialized models and connecting enterprise knowledge safely and securely.
Point Transformer V3.	For processing 3D point clouds, the Point Transformer V3 (PTv3) model is an effective and straightforward paradigm. By putting more of an emphasis on efficiency and scaling up than on fine-grained design details, it can attain quicker processing speeds and improved memory economy.
phidata.	Phidata is a toolkit for building AI Assistants using function calls. Function calling enables LLMs to achieve tasks by calling functions and intelligently choosing their next step based on the response, just like how humans solve problems.
ml-mgie.	Apple released code that uses multimodal language models to improve human-provided natural language edits to images.
Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting.	Lag-Llama is the first open-source foundation model for time series forecasting!
Learning to Fly in Seconds.	This repository contains the code for the paper Learning to Fly in Seconds. It allows to train end-to-end control policies using deep reinforcement learning. The training is done in simulation and is finished within seconds on a consumer-grade laptop. The trained policies generalize and can be deployed on real quadrotors
Packing Inputs Without Cross-Contamination Attention.	By concatenating instances, packing in training models can enhance training effectiveness. When examples are handled carelessly, contamination might happen since the focus isn't sure where to end. Although the community has discovered that EOS is frequent enough, issues can nevertheless arise. This repository offers a Hugging Face implementation for popular models to correctly compress input data.
ZLUDA.	ZLUDA lets you run unmodified CUDA applications with near-native performance on AMD GPUs.
GenTranslate.	A novel method called GenTranslate leverages massive language models to enhance translation quality. The best translations produced by foundational models are the main focus. Tests have shown that the approach performs better than the state-of-the-art translation models.
Design2Code.	Design2Code is an open-source project that converts various web design formats, including sketches, wireframes, Figma, XD, etc., into clean and responsive HTML/CSS/JS code. Just upload your design image, and Design2Code will automatically generate the code for you. It's that simple!
SGLang.	SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with LLMs faster and more controllable by co-designing the frontend language and the runtime system.
DALI.	This study presents cutting-edge techniques to guarantee that autonomous intelligent agents—which are essential in applications that depend on life—remain morally and ethically sound even as they develop.
Reor Project.	Reor is an AI-powered desktop note-taking app: it automatically links related ideas, answers questions on your notes and provides semantic search. Everything is stored locally and you can edit your notes with an Obsidian-like markdown editor.
Dinosaur: differentiable dynamics for global atmospheric modeling.	The Google group has made code available to support atmospheric modeling. DeepMind's latest weather modeling tools are built around this code.
Neural Flow.	This is a Python script for plotting the intermediate layer outputs of Mistral 7B. When you run the script, it produces a 512x256 image representing the output at every layer of the model. The concept is straightforward: collect the output tensors from each layer, normalize them between zero and one, and plot these values as a heatmap. The resulting image reveals a surprising amount of structure. I have found this enormously helpful for visually inspecting outputs when fine-tuning models.
Tabula Rasa: not enough data? Generate them!.	How you can apply generative AI to tabular data
A practical guide to neighborhood image processing.	Love thy neighbors: How the neighbors are influencing a pixel

Perspectives

Link	description
AI agents as a new distribution channel.	By making judgments about what to buy on behalf of customers, AI agents are starting to emerge as a new route of distribution that might level the playing field between startups and established players. Businesses will need to adjust their goods to cater to AI tastes instead of human ones as this trend develops, which will alter the conventional dynamics of product appraisal, purchase, and discovery. The development of AI portends a time when agent-driven commerce may completely change the way goods are advertised and bought.
Thinking about High-Quality Human Data.	The topic of this piece is how people generate data. It also covers labeling, annotating, and gathering preference data, among other topics.
AI Aesthetics.	Artificial Intelligence will radically transform the way we create, appreciate, and produce art. This article delves deeper into this topic and identifies the businesses spearheading the shift.
NYC: Brain2Music.	Research talk from Google about reading music from a person’s brain.
Massed Muddler Intelligence.	A move away from conventional monolithic AI scaling and toward a paradigm based on distributed, agent-based systems that learn and adapt in real-time is represented by the idea of massed muddler intelligence, or MMI. MMI promotes AI development that stresses scalable, interactive agents with a degree of autonomy and mutual governance, moving away from the current focus on accumulating larger datasets and computational resources. This approach is based on the principles of embodiment, boundary intelligence, temporality, and personhood.
AI Could Actually Help Rebuild The Middle Class.	AI doesn’t have to be a job destroyer. It offers us the opportunity to extend expertise to a larger set of workers.
Letter from the YouTube CEO: 4 Big bets for 2024.	YouTube is investing in diverse revenue streams for creators. The platform witnessed a 50% increase in the use of channel memberships. It is creating creator support networks through programs like the Creator Collective. Efforts are undertaken to help politicians appreciate and respect the economic and entertainment worth of artists.
Meta’s AI Chief Yann LeCun on AGI, Open-Source, and AI Risk.	Ahead of the award ceremony in Dubai, LeCun sat down with TIME to discuss the barriers to achieving “artificial general intelligence” (AGI), the merits of Meta’s open-source approach, and what he sees as the “preposterous” claim that AI could pose an existential risk to the human race.
Deepfakes, trolls and cybertroopers: how social media could sway elections in 2024.	Faced with data restrictions and harassment, researchers are mapping out fresh approaches to studying social media’s political reach.
Why "Chat over Your Data" Is Harder Than You Think.	Contrary to popular belief, developing chat-based, domain-specific LLM applications and copilots is challenging. Achieving strong performance, managing intricate queries and data, and providing robust data retrieval for LLM-based chat apps are a few of the difficulties.

Back to index

ML news: Week 5 - 11 February

Research

Link	description
Bad Actor, Good Advisor: Exploring the Role of Large Language Models in Fake News Detection.	When it comes to detecting bogus news, a refined BERT model performs better than an off-the-shelf LLM like GPT-3.5-turbo.
PAP-REC: Personalized Automatic Prompt for Recommendation Language Model.	To improve the efficacy and efficiency of Recommendation Language Models, PAP-REC has developed a technique that automatically generates tailored prompts.
PAM: Prompting Audio-Language Models for Audio Quality Assessment.	PAM is a tool that evaluates audio quality without reference tracks or specific training by using Audio-Language Models.
AnimateLCM: Accelerating the Animation of Personalized Diffusion Models and Adapters with Decoupled Consistency Learning.	AnimateLCM is a novel method that divides the learning process into two halves to rapidly produce high-quality films and enhance current video diffusion models.
Boximator: Generating Rich and Controllable Motions for Video Synthesis.	Controlling video synthesis is a well-known challenge. This paper suggests guiding the generation using boxes and arrows over time, which enhances human preference judgment but still leaves the user with imperfect guidance.
KTO: Model Alignment as Prospect Theoretic Optimization.	Kahneman-Tversky Optimization (KTO) is a novel method for conditioning AI models to more closely resemble human thought processes. Utilizing ideas from prospect theory developed by Kahneman & Tversky, KTO prioritizes utility above preference likelihood.
A simple method to reduce hallucination in Large Vision-Language Models.	This study clarifies the reasons for multimodal hallucination, a condition in which large vision-language models (LVLMs) occasionally represent visuals erroneously. One important factor is semantic shift bias, especially at paragraph breaks.
CapHuman: Capture Your Moments in Parallel Universes.	Given only one reference facial photograph, our CapHuman can generate photo-realistic specific individual portraits with content-rich representations and diverse head positions, poses, facial expressions, and illuminations in different contexts.
Nomic Embed: Training a Reproducible Long Context Text Embedder.	Nomic-Embed-Text-V1 is an open-source, completely reproducible text embedding model that raises the bar. It does well on activities with both short and lengthy contexts. Nomic-Embed-Text-V1, which is transparent to the extreme, provides full access to its model weights, training code, and a large dataset consisting of 235 million text pairs.
SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training?	Training large-scale picture models is difficult due to legitimate copyright concerns and the disappearance of large-scale datasets like LAION. This work demonstrates that 30 million artificially created pictures may be used to train a strong CLIP model.
Rethinking Optimization and Architecture for Tiny Language Models.	This work investigates how to focus on small models with fewer parameters to develop strong language models better suited for mobile devices.
Unified Hallucination Detection for Multimodal Large Language Models.	To address the important problem of hallucinations in Multimodal Large Language Models (MLLMs), researchers have created a new benchmark called MHaluBench, which is used to assess different hallucination detection techniques.
InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions.	With InteractiveVideo, users may now create videos in a new style that allows for dynamic user interaction. This intuitive framework, in contrast to conventional techniques, enables real-time adjustments utilizing text, graphics, painting, and even drag-and-drop.
DeepSeekMath.	DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4.
Natural language guidance of high-fidelity text-to-speech models with synthetic annotations.	These Stability AI-trained text-to-speech algorithms can follow exact natural language commands. Its developers artificially annotated a sizable corpus of speech for training as there isn't a sizable dataset with appropriate textual descriptions of audio for creation. This is a further illustration of the larger trend of generative modeling training, up-captioning, and annotating.
MusicRL: Aligning Music Generation to Human Preferences.	The Google MusicLM team used an RL approach on their music-generating models using 300k feedback pieces and other incentive signals. They discovered that in human preference experiments, it performs better than the base model; nonetheless, it is not evident whether RL technique produces the greatest quality output.
A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation.	To increase CLIP's performance in picture classification tasks without needing more training or resources, this article revisits the traditional Gaussian Discriminant Analysis (GDA) approach.
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model.	The line of sophisticated vision-language models for mobile devices known as MobileVLM V2 offers appreciable performance gains thanks to creative architecture.
The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs.	According to a recent study, multi-modal large language models (MLLMs) like GPT-4V have a flaw in that they make mistakes when dealing with particular kinds of image-text inputs. A benchmark called CorrelationQA was created to assess how well MLLMs performed in situations where text could be contradicted or misled by visuals.
Read to Play (R2-Play): Decision Transformer with Multimodal Game Instruction.	The creation of a generalist AI agent that can comprehend and adhere to gaming instructions is examined in this research as a first step toward "read-to-play" capabilities. The researchers incorporate multimodal game instructions into a decision transformer to improve the agent's multitasking and generalization abilities.
MetaTree: Learning a Decision Tree Algorithm with Transformers.	MetaTree is a transformer-based decision tree algorithm. It learns from classical decision tree algorithms for better generalization capabilities.

News

Link	description
Sakana Awarded Japanese Government Supercomputing Grant.	Sakana AI is one of seven institutions in Japan chosen by the Japanese government to receive a supercomputing grant, for encouraging the development of foundation AI models to strengthen the capabilities of Japan’s generative AI ecosystem.
Hugging Face launches open source AI assistant maker to rival OpenAI’s custom GPTs.	Hugging Face, the New York City-based startup that offers a popular, developer-focused repository for open source AI code and frameworks (and hosted last year’s “Woodstock of AI”), today announced the launch of third-party, customizable Hugging Chat Assistants.
Arc is building an AI agent that browses on your behalf.	The Browser Company, which makes the Arc Browser, is on a quest to change that by building an AI that surfs the web for you and gets you the results while bypassing search engines.
Introducing Qwen1.5.	0.5B to 72B range of parameters. This collection of multilingual models is outstanding. It's interesting to note that the first significant sub-1B parameter language model is the smallest model.
Inside OpenAI’s Plan to Make AI More ‘Democratic’.	Colin Megill met with Wojciech Zaremba, co-founder of OpenAI, in May 2023 to talk about integrating Polis, an AI-powered public debating platform that promotes democratic involvement. The cooperation sought to use public feedback to match AI with human ideals. It started the "Democratic Inputs to AI" project at OpenAI, which aims to investigate AI governance through a $1 million award program.
Roblox releases real-time AI chat translator.	Roblox built an AI model that it says translates text chats so quickly users may not even notice it’s translating the messages of other players at first. It works with 16 languages, including English, French, Japanese, Thai, Polish, and Vietnamese.
OpenAI is adding new watermarks to DALL-E 3.	OpenAI says watermarks in image metadata are not perfect, but they help build trust of digital information.
Microsoft Copilot for Sales and Copilot for Service are now generally available.	The AI-powered Copilot for Sales and Service from Microsoft is now widely accessible. It increases the efficiency of sales and support staff by integrating with CRM platforms like Salesforce. The solutions promise to improve customer interactions and expedite company operations by automating repetitive tasks and providing insights directly within Microsoft 365 apps. Early users of these AI capabilities, such as Avanade, report considerable time savings and improved client engagement.
First passages of rolled-up Herculaneum scroll revealed.	Researchers used artificial intelligence to decipher the text of 2,000-year-old charred papyrus scripts, unveiling musings on music and capers.
IBM wants to build a 100,000-qubit quantum computer.	The company wants to make large-scale quantum computers a reality within just 10 years.
Microsoft brings new AI image functionality to Copilot, adds new model Deucalion.	In a startling move, Microsoft today announced a redesigned look for its Copilot AI search and chatbot experience on the web (formerly known as Bing Chat), new built-in AI image creation and editing functionality, and a new AI model, Deucalion, that is powering one version of Copilot.
Meet ‘Smaug-72B’: The new king of open-source AI.	A new open-source language model has claimed the throne of the best in the world, according to the latest rankings from Hugging Face, one of the leading platforms for natural language processing (NLP) research and applications.
EU’s AI Act passes last big hurdle on the way to adoption.	The European Union’s AI Act, a risk-based plan for regulating applications of artificial intelligence, has passed what looks to be the final big hurdle standing in the way of adoption after Member State representatives today voted to confirm the final text of the draft law.
OpenAI forms a new team to study child safety.	Under scrutiny from activists — and parents — OpenAI has formed a new team to study ways to prevent its AI tools from being misused or abused by kids.
Human brain cells hooked up to a chip can do speech recognition.	Clusters of brain cells grown in the lab have shown potential as a new type of hybrid bio-computer.
Bard becomes Gemini: Try Ultra 1.0 and a new mobile app today.	You may now finally interact with Gemini Ultra 1.0 thanks to a new service that Google established. However, access to the model will need a monthly subscription fee. Additionally, a companion smartphone app exists.
1X robotics demonstration.	One robotics startup, 1X, has achieved significant advancements in video-to-control models. The robot, which is powered by neural networks that generate 10 Hz control impulses from visual input, has been demonstrated by the business executing a variety of tasks.
AR glasses with multimodal AI nets funding from Pokémon GO creator.	Today, Singapore-based Brilliant Labs announced its new product, Frame, a pair of lightweight AR glasses powered by a multimodal AI assistant called Noa. The glasses have captured the attention and investment of John Hanke, CEO of Niantic, the augmented reality platform behind games like Pokémon GO.

Resources

Link	description
aphrodite-engine.	For AI inference workloads, the Aphrodite engine can increase throughput while lowering VRAM needs.
chatllm-vscode.	ChatLLM is a VSCode extension for interacting with LLM APIs in a flexible and long-form manner. It leverages the VSCode notebook support to do so, creating a new type of notebook (.chatllm) files where you can interact with an (API-based) LLM system over a long document.
diffusers v0.26.0.	This new release comes with two new video pipelines, a more unified and consistent experience for single-file checkpoint loading, support for multiple IP-Adapters’ inference with multiple reference images, and more.
Ollama vision models.	Recently, support for vision models was introduced by Ollama. Llava 1.6 comes with both Python and JavaScript packages that offer enhanced support and vision functionality.
Image to Music v2.	Images to text, text to prompt, and prompt to music can all be translated into a visually appealing pipeline.
3DTopia.	A two-stage text-to-3D generation model. The first stage uses a diffusion model to quickly generate candidates. The second stage refines the assets chosen from the first stage.
Open Source Alternative to Rabbit.	An open-source version of the Rabbit hardware, complete with language modeling, is being developed by a team.
NaturalSQL by ChatDB.	NaturalSQL by ChatDB is a series of models with state-of-the-art performance on Text to SQL instructions.
contextual_bandits_tutorial.	Meta maintains the RL framework called Pearls. This tutorial uses the program to walk through a bandit-based learning problem.
BRIA Background Removal v1.4 Model Card.	RMBG v1.4 is our state-of-the-art background removal model, designed to effectively separate foreground from background in a range of categories and image types. This model has been trained on a carefully selected dataset, which includes: general stock images, e-commerce, gaming, and advertising content, making it suitable for commercial use cases powering enterprise content creation at scale.
MetaVoice-1B.	a small and powerful text-to-speech model that supports generation and voice cloning.
Latxa.	Latxa is a collection of foundation models specifically tuned for Basque.
fabric.	An open-source framework for augmenting humans using AI.
YOLO-World.	The process of locating objects and their bounding boxes is called object detection. Usually, only a predetermined selection of items selected during training may be used for this. This study presents a real-time approach capable of Open Vocabulary object identification, i.e., detecting bounding boxes for any combination of objects supplied at run-time.
SELF-DISCOVER.	the implementation of SELF-DISCOVER: Large Language Models Self-Compose Reasoning Structures. a novel prompting technique that allows language models to use a set of reasoning primitives to discover a larger framework for problem-specific reasoning.
AI Filter.	AI Filter is a Chrome extension that uses a local language model to filter your social media feeds (currently, only Twitter / X) according to your instructions.
Fully Local RAG using Ollama & PgVector.	Using Ollama, pgvector, and local data, you can create a complex and potent RAG system that operates on your hardware.
LightEval .	LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.
CogCoM.	CogCoM is a general vision-language model (VLM) endowed with a Chain of Manipulations (CoM) mechanism, that enables VLMs to perform multi-turns evidential visual reasoning by actively manipulating the input image. We now release CogCoM-base-17b, a model with 10 billion visual parameters and 7 billion language parameters, trained on a data fusion of 4 types of capabilities (instruction-following, OCR, detailed-captioning, and CoM).
How we got fine-tuning Mistral-7B to not suck: Helix Project Report, Feb 2024.	By using a set of a pair of questions that gathered material from a variety of viewpoints and produced a content-addressed hash for every document, HelixML was able to improve Mistral-7B.
VatsaDev/animebench-alpha.	a benchmark dataset with quotes and information about different anime characters to evaluate language model performance.
NextBrain: a next-generation, histological atlas of the human brain for high-resolution neuroimaging studies.	We present a next-generation probabilistic atlas of the human brain using histological sections of five full human hemispheres with manual annotations for 333 regions of interest. This website enables the interactive inspection of these five cases using a 3D navigation interface and search functionality.
Efficient Linear Model Merging for LLMs.	Model merging is a technique for combining multiple pre-trained or finetuned LLMs into a single, more powerful model. This approach is particularly useful when individual models excel in different domains or tasks, and merging them can create a model with a broader range of capabilities and improved overall performance.

Perspectives

Link	description
MIT Paper: AI’s Labor Market Impacts Are Slower Than Expected.	The economic feasibility of automating vision-based operations is examined in the working paper "Beyond AI Exposure: Which Tasks are Cost-Effective to Automate with Computer Vision?" authored by researchers from IBM and MIT. Just 23% of them are profitable to automate, it was discovered. In contrast with more disruptive expectations, the report projects a gradual impact on the job market over several years.
How AI Is Helping Us Learn About Birds.	Machine learning is powering new insights into how birds migrate—and forecasts about where they’ll go next
The Techno-Industrial Revolution.	The increasing sophistication of AI tooling and corporate use cases will lead to an increasing number of practical uses of the technology. The potential here can be viewed through the lens of how AI will increase margins significantly while lowering costs and improving process efficiency. This could open the door to entirely new approaches that weren't previously viable due to extremely narrow profit margins. A couple of these examples are examined in this article.
The path to profitability for AI in 2024.	The emphasis of AI research has recently shifted from accuracy and breadth to efficiency and depth. AI's increasing energy consumption and NVIDIA's H100 sales demonstrate the industry's size. Research is now focused on smaller, more efficient models, such as Phi 2, and emphasizes sustainable economics from model architecture to deployment, all because investments expect profitability. AI's computational efficiency and energy efficiency are expected to increase with advancements in training, fine-tuning, and design. On-device features are a reflection of a larger movement towards more useful and sustainable AI applications.
How design drove $10M in preorders for Rabbit R1 AI hardware.	In an expansive interview, Rabbit CEO Jesse Lyu shares how he collaborates with Teenage Engineering, why he didn’t want to make a phone, and how the R1’s retro-future design is key to the company’s strategy.
What’s next for robotaxis in 2024.	In addition to restoring public trust, robotaxi companies need to prove that their business models can compete with Uber and taxis.
Google's Gemini Advanced: Tasting Notes and Implications.	Similar to OpenAI's GPT-4, Google's recently released GPT-4 class AI model, Gemini Advanced, exhibits comparable characteristics. It excels at providing explanations and fusing search with images.
Thesis on value accumulation in AI.	This investor's perspective breaks down the layers of value that exist in AI today into three categories: AI-enhanced products (like all of you that use AI to improve your products), modeling and core (like OpenAI and Anthropic), and infrastructure layer (like cloud providers and chip makers).

Back to index

ML news: Week 29 January - 4 February

Research

Link	description
Matryoshka Representation Learning.	The new embeddings from OpenAI are scalable to meet your demands. This is thought to be caused by the learning strategy known as the nesting doll approach, which learns characteristics at different granularities.
Vivim: a Video Vision Mamba for Medical Video Object Segmentation.	A new framework called Vivim efficiently processes lengthy video sequences for medical video object segmentation. In comparison to conventional techniques, Vivim provides faster and more accurate segmentation results by effectively compressing spatiotemporal data using the state space model methodology.
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities.	This study presents a unique way to improve transformers by utilizing disparate input from many modalities, e.g., audio data to improve an image model. By connecting the transformers of two distinct modalities in a unique way, the Multimodal Pathway enables a target modality to profit from the advantages of another.
pix2gestalt: Amodal Segmentation by Synthesizing Wholes.	A framework called Pix2Gestalt is intended for zero-shot amodal segmentation. When an item is partially occluded, it can rebuild its entire shape and look with great skill. Pix2Gestalt, which makes use of large-scale diffusion models, performs exceptionally well in difficult situations, such as producing artistic images that break convention.
Large-Vocabulary 3D Diffusion Model with Transformer.	The variety of objects that may be generated in 3D poses a significant difficulty. This study builds up the system to operate with a considerably bigger range of items in each 3D category and employs a changed architecture to enhance sampling efficiency.
SliceGPT: Compress Large Language Models by Deleting Rows and Columns.	Another potential distillation work. Importantly, this one can work on models as small as Phi-2. This means you can remove 90% of the rows and columns of weight matrices with minimal reduction to quality at almost all scales.
Learning Universal Predictors.	The process of teaching systems to learn from experience and swiftly adjust to new tasks is known as meta-learning. With artificial data produced by a Universal Turing Machine, this Google project enhances Meta-Learning and conducts both theoretical and experimental analysis of the outcomes.
CreativeSynth: Creative Blending and Synthesis of Visual Arts based on Multimodal Diffusion.	CreativeSynth is an artistic picture editing technique that combines text and image inputs in a seamless manner. Its diffusion approach, which has specialized attention processes built in, allows for fine alteration of both style and content while maintaining the essential elements of the original artwork.
Annotated Hands for Generative Models.	By adding three more channels to training photos for hand annotations, researchers have increased the capacity of generative models, such as GANs and diffusion models, to produce realistic hand images.
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling.	Many AI systems employ the concept of "up captioning" to enhance labels during training. This work from Apple rephrases C4 as instructions, Q&A pairs, and more in order to apply it to pre-training. The rephrasing step increased convergence by 10x, according to the study, making the model significantly more sample-efficient, albeit at the expense of the rephrasing step itself.
Continual Learning with Pre-Trained Models: A Survey.	This work provides an extensive overview of the most recent developments in continuous learning, which is centered on continually adjusting to new information while preserving prior understanding.
MacGNN.	The MAcro Recommendation Graph (MAG) and Macro Graph Neural Networks (MacGNN) are introduced in this research. These methods greatly reduce the number of nodes by assembling similar behavior patterns into macro nodes, which addresses the computational difficulty of Graph Neural Networks.
Machine learning predicts which rivers, streams, and wetlands the Clean Water Act regulates.	Our framework can support permitting, policy design, and use of machine learning in regulatory implementation problems.
Weaver: Foundation Models for Creative Writing.	A group of models called Weaver have been trained especially to narrate stories. On a benchmark for storytelling, the biggest model (34B params) performs better than GPT-4.
Text Image Inpainting via Global Structure-Guided Diffusion Models.	In this study, two datasets for handwritten words and scenes are introduced, along with a benchmark. With original, damaged, and assistant photos, the new Global Structure-guided Diffusion Model (GSDM) effectively recovers clean texts by making use of text structure. Both picture quality and identification accuracy demonstrate notable gains.
Multi-granularity Correspondence Learning from Long-term Noisy Videos.	With Norton, the multi-granularity noisy correspondence problem in video-language studies is addressed, offering a novel strategy for enhancing long-term video comprehension.
GPAvatar: Generalizable and Precise Head Avatar from Image(s).	With the use of a Multi Tri-planes Attention module and a dynamic point-based expression field, GPAvatar presents a novel technique for generating 3D head avatars from photos.
MobileDiffusion: Rapid text-to-image generation on-device.	With certain architectural modifications, Google has demonstrated a latent consistency diffusion model that it trained for sub-second generation times on mobile devices.
SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks.	Shared Network Pre-training (SNP) enhances the joint learning of text and video. Compared to earlier models, this approach is more effective and adaptable and incorporates a novel technique called Significant Semantic Strengthening (S3) to improve comprehension of important terms in sentences.
Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation.	An improved version of the Segment Anything Model (SAM) with a focus on hierarchical text segmentation is called Hi-SAM. Hi-SAM is an excellent text segmenter at several levels, ranging from strokes to paragraphs, and it can even analyze layouts.

News

Link	description
Voltron Data acquires Claypot to unlock real-time AI with modular data systems.	Today, San Francisco-based Voltron Data, a startup providing enterprises with a modular and composable approach to building systems for data analytics, confirmed to VentureBeat that is acquiring real-time AI platform Claypot. The terms of the deal were not disclosed.
FTC investigating Microsoft, Amazon, and Google investments into OpenAI and Anthropic.	The commission wants to understand the tangled web of investments between cloud providers and AI startups.
Google’s New AI Is Learning to Diagnose Patients.	The DeepMind team turns to medicine with an AI model named AMIE
1/100th of the cost: CPU startup Tachyum claims that one of its processing units can rival dozens of Nvidia H200 GPUs — with a 99% saving that could turn the AI market on its head if true.	The 5nm Prodigy processor can dynamically switch between AI, HPC, and cloud workloads and costs $23,000
ChatGPT is violating Europe’s privacy laws, Italian DPA tells OpenAI.	OpenAI has been told it’s suspected of violating European Union privacy, following a multi-month investigation of its AI chatbot, ChatGPT, by Italy’s data protection authority.
This whimsical clock is the playful gadget AI needs right now.	The Poem/1 clock dreams up a new poem every minute to tell you the time. Do you need it? No. But you might want it.
iOS 17.4: Apple continues work on AI-powered Siri and Messages features, with help from ChatGPT.	Apple is widely expected to unveil major new artificial intelligence features with iOS 18 in June. Code found by 9to5Mac in the first beta of iOS 17.4 shows that Apple is continuing to work on a new version of Siri powered by large language model technology, with a little help from other sources.
Opera to launch new AI-powered browser for iOS in Europe following Apple’s DMA changes.	Opera revealed today that it will launch a new AI-powered browser built on its own engine for iOS in Europe. The Norway-based company announced the change following the news that Apple is going to allow alternative browser engines to run on iOS as a result of the requirements of the European Digital Markets Act (DMA).
Mistral CEO confirms ‘leak’ of new open source AI model nearing GPT-4 performance.	The past few days have been a wild ride for the growing open source AI community — even by its fast-moving and freewheeling standards.
Microsoft LASERs away LLM inaccuracies.	Microsoft’s LASER method seems counterintuitive, but it makes models trained on large amounts of data smaller and more accurate.
LLaVA-1.6: Improved reasoning, OCR, and world knowledge.	The most recent iteration of the visual language model Llava features enhanced reasoning, global knowledge, and OCR. It complements Gemini in some duties. The model, code, and data will be made available by the Llava team.
ServiceNow’s statement on AI.	The $150 billion market capitalization business ServiceNow revealed last week that, among all of its new product family launches, including its initial Pro SKU, its generation AI solutions generated the biggest net new ACV contribution for the first full quarter. It's exciting to see that enterprise-level AI applications are already contributing to significant revenue growth.
Bard’s latest updates: Access Gemini Pro globally and generate images.	You can now generate images in Bard in English in most countries around the world, at no cost. This new capability is powered by our updated Imagen 2 model
Amazon debuts ‘Rufus,’ an AI shopping assistant in its mobile app.	Amazon announced today the launch of an AI-powered shopping assistant it’s calling Rufus that’s been trained on the e-commerce giant’s product catalog as well as information from around the web.

Resources

Link	description
imp-v1-3b .	An additional multimodal model trained using SigLIP and Phi-2. This one is tiny enough to run on-device and provides very promising performance.
WebDataset.	WebDataset is a library for writing I/O pipelines for large datasets. Its sequential I/O and sharding features make it especially useful for streaming large-scale datasets to a DataLoader.
LLMs-from-scratch.	An unfinished yet intriguing series of exercises to teach language model building from the beginning.
Exploring ColBERT with RAGatouille.	For RAG applications, ColBERT is a great paradigm to embed queries and index data. This article runs some benchmarks and examines the method's underlying intuition.
mamba.rs.	Inspired by efforts on the Llama models, this project uses pure Rust to run inference for Mamba on the CPU.
🦙 Code Llama.	Code Llama is a code-specialized version of Llama 2 that was created by further training Llama 2 on its code-specific datasets, sampling more data from that same dataset for longer.
Eagle 7B : Soaring past Transformers with 1 Trillion Tokens Across 100+ Languages (RWKV-v5).	A brand new era for the RWKV-v5 architecture and linear transformer has arrived - with the strongest multi-lingual model in open source today
InconsistencyMasks.	A novel technique for picture segmentation called Inconsistency Masks (IM) functions even with sparse data. Tested on the ISIC 2018 dataset, our method performs better than conventional methods and even surpasses models trained on fully labeled datasets.
distortion-generator.	A novel technique for picture distortion strikes a compromise between privacy and accuracy in biometric systems, rendering facial photos incomprehensible to humans yet identifiable to AI.
TaskingAI.	TaskingAI brings Firebase's simplicity to AI-native app development. The platform enables the creation of GPTs-like multi-tenant applications using a wide range of LLMs from various providers. It features distinct, modular functions such as Inference, Retrieval, Assistant, and Tool, seamlessly integrated to enhance the development process.
100x Faster Clustering with Lilac Garden.	A difficulty in language model training is locating a sufficiently varied dataset. It is considerably more difficult to visualize this data. This useful tool facilitates data exploration to enhance filtering and overall quality through topic modeling and quick clustering.
float8_experimental.	Although less precise model training is quicker and less expensive, it is less reliable. Quantized training has been the subject of several excellent contemporary studies. Building on those foundations, this repository offers float8 teaching through readable and hackable code.
Enchanted.	Enchanted is an open-source, Ollama-compatible, elegant iOS/iPad mobile app for chatting with privately hosted models such as Llama 2, Mistral, Vicuna, Starling, and more. It's essentially ChatGPT app UI that connects to your private Ollama models. You can download Enchanted from the App Store or build yourself from scratch.
Introduction to point processing.	Whether you are doing medical image analysis or you use Photoshop, you are using point preprocessing
MF-MOS: A Motion-Focused Model for Moving Object Segmentation.	A new model called MF-MOS makes use of LiDAR technology to more effectively identify moving objects during autonomous driving. Using residual maps for motion capture and range pictures for semantic guiding distinguishes motion from semantic information in a unique way.
Mctx: MCTS-in-JAX.	Mctx is a library with a JAX-native implementation of Monte Carlo tree search (MCTS) algorithms such as AlphaZero, MuZero, and Gumbel MuZero. For computation speed up, the implementation fully supports JIT-compilation.
FireLLaVA: the first commercially permissive OSS LLaVA model.	A new open vision model called FireLlava can be used for commercial applications after it was trained on data. It performs similarly to the first Llava, but not quite as well as Llava 1.5.
uAgents: AI Agent Framework.	uAgents is a library developed by Fetch.ai that allows for the creation of autonomous AI agents in Python. With simple and expressive decorators, you can have an agent that performs various tasks on a schedule or takes action on various events.
teknium/OpenHermes-2.5.	Some of the top open models available have been trained using data from OpenHermes-2.5. More than one million high-quality data points are included in the collection. It's now available for purchase.
OLMo: Open Language Model.	A State-Of-The-Art, Truly Open LLM and Framework
BAAI/bge-m3.	A flexible embedding model that performs very well in multi-functionality (dense, multi-vector, and sparse retrieval), multi-linguistic (supporting more than 100 languages), and multi-granularity (managing inputs ranging from brief phrases to documents with up to 8192 tokens) is presented by the BGE-M3 project. It makes use of a hybrid retrieval pipeline, which leverages its simultaneous embedding and sparse retrieval capabilities, to combine several techniques and re-ranking for increased accuracy and generalization.
RAGs.	Using natural language, users can develop RAG pipelines from data sources with the help of the Streamlit app RAGs. All users need to do is specify the parameters and tasks they require from their RAG systems. You can query the RAG, and it will respond to inquiries about the information.
GPT Newspaper.	GPT Newspaper project, an innovative autonomous agent designed to create personalized newspapers tailored to user preferences. GPT Newspaper revolutionizes the way we consume news by leveraging the power of AI to curate, write, design, and edit content based on individual tastes and interests.

Perspectives

Link	description
Many AI Safety Orgs Have Tried to Criminalize Currently-Existing Open-Source AI.	Numerous teams are attempting to address the difficulties posed by the quickly developing field of artificial intelligence.
AlphaFold found thousands of possible psychedelics. Will its predictions help drug discovery?	Researchers have doubted how useful the AI protein-structure tool will be in discovering medicines — now they are learning how to deploy it effectively.
Reaching carbon neutrality requires energy-efficient training of AI.	Artificial intelligence (AI) models have achieved remarkable success, but their training requires a huge amount of energy.
What will robots think of us?	Two recent science fiction novels humorously illustrate the importance of correct robot mental models.
What Can be Done in 59 Seconds: An Opportunity (and a Crisis).	AI is already capable of completing several jobs in less than a minute, thus businesses and staff will need to stress the need to utilize AI for good rather than evil.
The American Dynamism 50: AI.	This list of 50 companies, compiled by a16z, addresses some of the most important issues facing the US in the areas of manufacturing, transportation, energy, and military. They're all utilizing AI to speed up their work in one way or another. This is an excellent insight if you're interested in practical uses of artificial intelligence.

Back to index

ML news: Week 22 - 28 January

Research

Link	description
OMG-Seg: Is One Model Good Enough For All Segmentation?.	OMG-Seg can handle over ten different segmentation tasks in one framework, including image-level and video-level segmentation tasks, interactive segmentation, and open-vocabulary segmentation. To our knowledge, this is the first model to unify these four directions.
Instance Brownian Bridge as Texts for Open-vocabulary Video Instance Segmentation.	BriVIS, an approach that enhances open-vocabulary Video Instance Segmentation (VIS), was created by researchers. BriVIS achieves a more precise alignment between text and video by preserving the context of object motions across video frames through the use of a method known as Brownian Bridges.
Encoder-minimal and Decoder-minimal Framework for Remote Sensing Image Dehazing.	A novel framework called RSHazeNet was created to eliminate haze from remote-sensing photos. The tool makes use of cutting-edge modules to enhance image comprehension and detail preservation, improving clarity and analytical use.
Supervised Fine-tuning in turn Improves Visual Foundation Models.	Drawing inspiration from supervised fine-tuning (SFT) in natural language processing such as instruction tuning, we explore the potential of fine-grained SFT in enhancing the generation of vision foundation models after their pretraining. Thus a two-stage method ViSFT (Vision SFT) is proposed to unleash the fine-grained knowledge of vision foundation models.
Group Anything with Radiance Fields.	Hierarchical grouping in 3D by training a scale-conditioned affinity field from multi-level masks
DiverseEvol.	We introduce DiverseEvol, an efficient instruction-tuning method that allows the model itself to iteratively sample training subsets to improve its own performance, without any external supervision from humans or more advanced LLMs.
Unleashing the Power of Large-Scale Unlabeled Data.	Depth Anything is trained on 1.5M labeled images and 62M+ unlabeled images jointly, providing the most capable Monocular Depth Estimation (MDE)
Prompt Highlighter: Interactive Control for Multi-Modal LLMs.	By enabling users to highlight specific portions of prompts, researchers present the "Prompt Highlighter," a technique that transforms text production in multi-modal language models.
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer.	A novel generative model called MM-Interleaved is very good at handling and producing interleaved image-text data.
Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation.	A different preference optimization method is now being used in machine translation. For this job, it is more data-efficient than DPO. Crucially, the goal prevented the model from suggesting correct but inadequate translations, allowing it to perform competitively on WMT.
WARM: On the Benefits of Weight Averaged Reward Models.	In RLHF, reward models are employed to simulate human desire; nevertheless, the model that is being aligned frequently "hacks the reward" and performs poorly. The resultant aligned model is favored 79% of the time over one aligned with a single reward model. This is achieved by combining numerous reward models that maintain a linear mode connection. Although model merging may be merely regularization, it has shown to be an effective training phase for the general language model pipeline and has performed fairly well in general models.
Benchmarking Large Multimodal Models against Common Corruptions.	This technical study introduces MMCBench, a new benchmark created to evaluate large multimodal models' (LMMs) consistency and dependability on a variety of tasks, including text-to-image and speech-to-text. It covers more than 100 well-known models with the goal of helping readers better comprehend how various AI systems function in practical situations.
Predicting multiple conformations via sequence clustering and AlphaFold2.	AlphaFold2 has revolutionized structural biology by accurately predicting single structures of proteins. However, a protein’s biological function often depends on multiple conformational substates, and disease-causing point mutations often cause population changes within these substates
HEDNet: A Hierarchical Encoder-Decoder Network for 3D Object Detection in Point Clouds.	HEDNet is a novel encoder-decoder network that aims to improve autonomous cars' ability to recognize 3D objects by tackling the problem of sparse point distribution in 3D situations.
Prompt Pool based Class-Incremental Continual Learning for Dialog State Tracking.	This project proposes a novel prompt pool approach to recording the status of dialogs that do not need task IDs during testing, allowing it to adjust to changing user requirements.
DittoGym: Learning to Control Soft Shape-Shifting Robots.	A major problem with soft robotics is the wide control space. In this study, a simulator with a variety of tasks for handling soft objects that resemble "dittos" is introduced. It includes several powerful baselines, visualization, and utilities.
SGTR+: End-to-end Scene Graph Generation with Transformer.	A novel technique that researchers have created speeds up and improves the efficiency of the scene graph creation process. Their transformer-based approach aims to enhance the model's comprehension and interconnection of many parts in a picture, resulting in enhanced performance on complex tasks.
DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data.	Based on how similar two photographs are to one another, image similarity systems provide a score. This study builds upon earlier approaches, mainly by using artificial intelligence and human preferences.
SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation.	A model called SegMamba is intended for 3D medical image segmentation. In comparison to the Transformer architecture, it provides a more effective option.
SFC: Shared Feature Calibration in Weakly Supervised Semantic Segmentation.	To improve semantic segmentation, researchers have created the Shared Feature Calibration (SFC) technique.

News

Link	description
OpenAI’s Sam Altman Is Raising Money to Set Up AI Chip Factories.	A new report reveals that OpenAI CEO Sam Altman is gearing up to raise money to set up his own network of AI chip factories.
Google DeepMind scientists in talks to leave and form AI startup.	A pair of scientists at Google's artificial intelligence subsidiary DeepMind is in talks with investors to form an AI startup in Paris, Bloomberg News reported on Friday, citing people familiar with the conversations.
The AI phones are coming.	We’re tired of tapping through apps on our phones all day. Can Samsung show us an AI tool to save us?
How Microsoft found a potential new battery material using AI.	Advances in AI and high-performance computing are changing the way scientists look for new battery materials.
Google will pitch Bard Advanced as providing ‘complex, better responses’.	At the start of December, Google said Gemini Ultra would launch in early 2024 and be available in “Bard Advanced.” When it launches, Google will position Bard Advanced as providing “complex, better responses.”
Stability AI unveils smaller, more efficient 1.6B language model as part of ongoing innovation.	Stability AI, the vendor that is perhaps best known for its stable diffusion text to image generative AI technology, today released one of its smallest models yet, with the debut of Stable LM 2 1.6B.
Tesla finally releases FSD v12, its last hope for self-driving.	Tesla has finally started releasing its FSD Beta v12 update to customers, which is sort of its last hope to deliver on its self-driving promises.
Code LoRA From Scratch.	LoRA, which stands for Low-Rank Adaptation, is a popular technique to finetune LLMs more efficiently. Instead of adjusting all the parameters of a deep neural network, LoRA focuses on updating only a small set of low-rank matrices. This Studio explains how LoRA works by coding it from scratch, which is an excellent exercise for looking under the hood of an algorithm.
Microsoft’s Nadella Wants Stability at OpenAI, Not Control.	In the midst of regulatory reviews in the EU and the UK, Microsoft CEO Satya Nadella is happy with the current condition of Microsoft's cooperation with OpenAI, emphasizing stability above control. He highlights both Microsoft's substantial funding in OpenAI and their own autonomous AI research.
ElevenLabs Releases New Voice AI Products and Raises $80M Series B.	To strengthen its position in voice AI research and product development
Google Chrome gains AI features, including a writing helper, theme creator, and tab organizer.	Google’s Chrome web browser is getting an infusion of AI technology in the latest release. The company announced today it’s soon adding a trio of new AI-powered features to Chrome for Mac and Windows, including a way to smartly organize your tabs, customize your theme, and get help when writing things on the web — like forum posts, online reviews, and more.
Anthropic researchers find that AI models can be trained to deceive.	Most humans learn the skill of deceiving other humans. So can AI models learn the same? Yes, the answer seems — and terrifyingly, they’re exceptionally good at it.
Google shows off Lumiere, a space-time diffusion model for realistic AI videos .	Lumiere, a space-time diffusion model proposed by researchers from Google, Weizmann Institute of Science and Tel Aviv University to help with realistic video generation.
Adept Fuyu-Heavy: A new multimodal model.	Adept Fuyu-Heavy is a new multimodal model designed specifically for digital agents. In particular, Fuyu-Heavy scores higher on the MMMU benchmark than even Gemini Pro.
Report: Apple Making ‘Significant’ Push to Bring AI to iPhones.	Apple is reportedly making a major push to bring artificial intelligence (AI) to the iPhone.
Hugging Face and Google partner for open AI collaboration.	Today, we are thrilled to announce our strategic partnership with Google Cloud to democratize good machine learning. We will collaborate with Google across open science, open source, cloud, and hardware to enable companies to build their own AI with the latest open models from Hugging Face and the latest cloud and hardware features from Google Cloud.
OpenAI's New embedding models and API updates.	We are launching a new generation of embedding models, new GPT-4 Turbo and moderation models, new API usage management tools, and soon, lower pricing on GPT-3.5 Turbo.
Announcing Qdrant's $28M Series A Funding Round.	The firm behind the vector database, which powers some of ChatGPT and X's "More like this," has secured funds to enhance its corporate solutions and extend its Rust-based vector store.

Resources

Link	description
nanotron.	The objective of this library is to provide easily distributed primitives in order to train a variety of models efficiently using 3D parallelism.
DataTrove.	DataTrove is a library to process, filter, and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality.
CaptionIMG.	A Simple program is written in Python to manually caption your images (or any other file types) so you can use them for AI training. I use it for Dreambooth training (StableDiffusion).
AI Toolkit.	AI Toolkit is a header-only C++ library that provides tools for building the brain of your game's NPCs.
Face Mixer Diffusion.	This piece demonstrates how to clone faces in photos using diffusion. Although there are other methods for creating deep fakes, diffusion is intriguing since it allows for the necessary inpainting of other image elements.
Self-Rewarding Language Model.	Implementation of the training framework proposed in the Self-Rewarding Language Model, from MetaAI
snorkelai/Snorkel-Mistral-PairRM-DPO.	A powerful new Mistral tune that creates a DPO-compatible dataset by cleverly using poor supervision and synthetic data. Numerous iterations of the described procedure can be used for a broad range of corporate use cases.
nanoColBERT.	ColBERT is a powerful late-interaction model that can perform both retrieval and reranking.
RPG-DiffusionMaster.	RPG is a powerful training-free paradigm that can utilize proprietary MLLMs (e.g., GPT-4, Gemini-Pro) or open-source local MLLMs (e.g., miniGPT-4) as the prompt reception and region planner with our complementary regional diffusion to achieve SOTA text-to-image generation and editing. Our framework is very flexible and can generalize to arbitrary MLLM architectures and diffusion backbones.
Matrix Multiplication: Optimizing the code from 6 hours to 1 sec.	A brief read about matrix multiplication optimizations particular to certain hardware and a generic procedure to accelerate AI programs.
SyncTalk: Mastering Realism in Talking Head Videos.	A significant advancement in realistic talking head videos is SyncTalk. It solves earlier problems with lip motions, expressions, and facial identity synchronization.
Hallucination Leaderboard.	Public LLM leaderboard computed using Vectara's Hallucination Evaluation Model. This evaluates how often an LLM introduces hallucinations when summarizing a document. We plan to update this regularly as our model and the LLMs get updated over time.
Embedding English Wikipedia in under 15 minutes.	Modal provides a serverless solution for organizations grappling with scaling workloads. Modal’s technology enables rapid scaling across many GPUs, which we can use to run large-scale workloads, such as generating embeddings for a massive text dataset, at lightning speed.
Concrete Steps to Get Started in Transformer Mechanistic Interpretability.	Among the founders of Mechanistic Interpretability (MI) is Neel Nanda. This serves as his entry guide into the industry. It has two hundred specific open-ended questions. The research of language models' quantitative values, or MI, involves actually examining neurons. Even though there hasn't been much progress in this area of study yet, it is accessible because it doesn't demand a lot of processing power.
The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation.	SDD contains ~1.1k captions for 706 permissively licensed music recordings. It is designed for use in the evaluation of models that address music-and-language (M&L) tasks such as music captioning, text-to-music generation, and music-language retrieval.
DiffMoog: A Modular Differentiable Commercial-like Synthesizer.	This repo contains the implementation of DiffMoog, a differential, subtractive, modular synthesizer, incorporating standard architecture and sound modules commonly found in commercial synthesizers.
TensorDict.	TensorDict is a dictionary-like class that inherits properties from tensors, such as indexing, shape operations, casting to device or point-to-point communication in distributed settings. The main purpose of TensorDict is to make code bases more readable and modular by abstracting away tailored operations
Evaluation Metrics for LLM Applications In Production.	How to measure the performance of LLM applications without ground truth data.
Asynchronous Local-SGD Training for Language Modeling.	This repository contains a Colab notebook that presents a minimal toy example replicating the observed optimization challenge in asynchronous Local-SGD. The task is to perform classification on a mixture of mixtures of Gaussian data.
SpeechGPT: Speech Large Language Models.	A novel speech synthesis model called SpeechGPT-Gen effectively manages the intricacies of language and voice traits.
LLM Steer.	A Python module to steer LLM responses towards a certain topic/subject and to enhance capabilities (e.g., making it provide correct responses to tricky logical puzzles more often). A practical tool for using activation engineering by adding steer vectors to different layers of a Large Language Model (LLM). It should be used along with the Transformers library.
RoMa: A lightweight library to deal with 3D rotations in PyTorch..	RoMa (which stands for Rotation Manipulation) provides differentiable mappings between 3D rotation representations, mappings from Euclidean to rotation space, and various utilities related to rotations. It is implemented in PyTorch and aims to be an easy-to-use and reasonably efficient toolbox for Machine Learning and gradient-based optimization.
AgentBoard: An Analytical Evaluation Board of Multi-Turn LLM Agent.	AgentBoard is a benchmark designed for multi-turn LLM agents, complemented by an analytical evaluation board for detailed model assessment beyond final success rates. The main Performance of different LLMs across various environments are shown below, please check our Results for more details.
makeMoE: Implement a Sparse Mixture of Experts Language Model from Scratch.	This blog walks through implementing a sparse mixture of experts language model from scratch. This is inspired by and largely based on Andrej Karpathy's project 'makemore' and borrows a number of reusable components from that implementation.

Perspectives

Link	description
Text-to-Video: The Task, Challenges and the Current State.	Text-to-video is next in line in the long list of incredible advances in generative models. How do these models work, how do they differ from text-to-image models, and what kind of performance can we expect from them?
My AI Timelines Have Sped Up (Again).	In light of developments in scaling up models, the author updated their forecasts for the AI timetable. As of right now, they predict that artificial general intelligence will be achieved with a 10% probability by 2028 and a 50% likelihood by 2045. The efficacy of massive language models and the knowledge that numerous intelligent capabilities may arise at scale are credited with these changes.
Should The Future Be Human?.	Elon Musk and Larry Page have a deep disagreement over the possible risks associated with artificial intelligence. Page has called Musk a "speciesist" for favoring humans over digital life forms, which has caused a gap in their friendship. This demonstrates the necessity for careful and deliberate development of AI technology and reflects the larger discussion on the influence of AI, which includes worries about consciousness, individuation, art, science, philosophy, and the potential for mergers between humans and AI.
Computers make mistakes and AI will make things worse — the law must recognize that.	A tragic scandal at the UK Post Office highlights the need for legal change, especially as organizations embrace artificial intelligence to enhance decision-making.
Google AI has better bedside manner than human doctors — and makes better diagnoses.	Researchers say their artificial intelligence system could help to democratize medicine.
Tech developers must respect equitable AI access.	We argue for a legal framework to ensure equitable access to artificial intelligence (AI) tools, such as ChatGPT, to avoid limiting their benefits to a privileged few
Seven technologies to watch in 2024.	Advances in artificial intelligence are at the heart of many of this year’s most exciting areas of technological innovation
If AI Were Conscious, How Would We Know?.	When discussing AI consciousness, references to Searle's Chinese Room Thought Experiment and the Turing Test are frequently made. The former examines whether an AI's conduct can be distinguished from that of a human, while the latter contends that exterior behavior is insufficient to demonstrate consciousness. Given that our knowledge of consciousness in AI is mostly derived from functionalist theories and human experiences, this argument emphasizes how difficult it is to define and identify consciousness in AI.
AI today and trends for an AI future.	A survey of experts on: How are early adopters using AI today? Where is AI going in 2024?

Back to index

ML news: Week 15 - 21 January

Research

Link	description
I am a Strange Dataset: Metalinguistic Tests for Language Models.	An example of a self-referential challenge phrase is "the last word in this sentence is." This kind of phrase is extremely difficult for language models to handle. This work presents a dataset and some assessments aimed at enhancing the metalinguistic capabilities of language models.
PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models.	A complementary line of inquiry to the well-known Stable Diffusion collection of picture generating models has been PixArt. With the use of ControlNet-style prompting and latent consistency models, this work improves control and speeds up creation.
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.	Anthropic has published some intriguing research in which a sleeper phrase designed to induce a particular response is used to deliberately poison a language model. It discovered that this kind of model could not be "aligned" with the robust system that it utilized for its production models. In other words, once the model was poisoned, negative behavior could not be undone with the resources available today.
PALP: Prompt Aligned Personalization of Text-to-Image Models.	Right now, Dreambooth is the most effective way to customize an image model. Prompt alignment is composable and significantly increases adherence to the prompt.
INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning.	We introduce a novel instruction tuning dataset, INTERS, encompassing 21 tasks across three fundamental IR categories: query understanding, document understanding, and query-document relationship understanding. The data are derived from 43 distinct datasets with manually written templates.
Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach.	A new dataset called INTERS has been created by researchers with the goal of enhancing the performance of big language models such as Mistral and LLaMA in information retrieval tasks.
HiCMAE.	A revolutionary self-supervised learning framework called HiCMAE was created to improve AVER or Audio-Visual Emotion Recognition. This method leverages large-scale pre-training on unlabeled audio-visual data to get over data scarcity issues.
Language Enhanced Multi-modal Grounding Model.	A novel end-to-end multimodal grounding model called LEGO exhibits sophisticated comprehension and grounding skills across several modalities, including pictures, sounds, and videos.
The Unreasonable Effectiveness of Easy Training Data for Hard Tasks.	Challenging data has long been assumed to be necessary to solve challenging issues, yet this data is noisy and difficult to identify. This work demonstrates that models may be made far more capable of generating solutions to difficult situations by fine-tuning them on related but easy data. A further piece of evidence to back up fine-tuning is that it elicits information rather than imparts it.
Mutual Distillation Learning For Person Re-Identification.	By merging two distinct approaches, researchers have created a revolutionary method called Mutual Distillation Learning For Person Re-identification (MDPR) that improves person re-identification.
Large language models help computer programs to evolve.	A branch of computer science known as genetic programming has been given a boost with the application of large language models that are trained on the combined intuition of the world’s programmers. comment here.
Solving olympiad geometry without human demonstrations.	Proving mathematical theorems at the olympiad level represents a notable milestone in human-level automated reasoning Blog post from DeepMind.
Fast and Expressive LLM Inference with RadixAttention and SGLang.	Two new advances for language model inference have been provided by LMSYS. The first is a backend tweak that raises the performance of tokens per second overall. Prompting parallelism is possible with the second prompting approach, which is an embedded language tailored to a particular domain.
Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities.	The difficulty of creating Vision Foundation Models (VFMs) especially for autonomous driving is examined in this research. It offers insights into pre-training, task adaptability, and data preparation in AI by examining more than 250 research articles, showcasing state-of-the-art methods such as 3D Gaussian Splatting and NeRF.
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models.	By concentrating on video tasks, DoraemonGPT, a novel artificial intelligence system built on huge language models, advances our comprehension of dynamic real-world events. For effective spatial-temporal querying, it transforms films into a symbolic memory. It also includes specialized tools and an innovative planner for handling challenging tasks.
Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering.	AlphaCodium presents a new method to improve LLMs' code creation. As evidenced by the CodeContests dataset, this multi-stage, test-based iterative procedure greatly increases the accuracy of models such as GPT-4 in tackling complicated programming tasks.
Foundations of Vector Retrieval.	Almost all of the information one may want to know about the current status of the vector retrieval area is covered in this enormous document. It will take some time to go through this important resource.
Learning to Follow Object-Centric Image Editing Instructions Faithfully.	This study addresses issues such as ambiguous instructions and selectively selecting regions of the image to modify, hence enhancing the quality of photographs modified with natural language instructions.

News

Link	description
OpenAI changes policy to allow military applications.	In an unannounced update to its usage policy, OpenAI has opened the door to military applications of its technologies.
Using AI, MIT researchers identify a new class of antibiotic candidates.	These compounds can kill methicillin-resistant Staphylococcus aureus (MRSA), a bacterium that causes deadly infections.
Microsoft wants to automatically launch its Copilot AI on some Windows 11 devices.	You might see Copilot start automatically opening on Windows 11 soon, but only with certain display situations.
Microsoft launches Copilot Pro for $20 per month per user.	Copilot Pro gives you the latest features and best models that Microsoft AI has to offer.
How OpenAI is approaching 2024 worldwide elections.	We’re working to prevent abuse, provide transparency on AI-generated content, and improve access to accurate voting information.
Sakana AI raises $30m seed.	In Tokyo, Sakana.ai is constructing a state-of-the-art research facility to create foundation models that are more compact and effective. David Ha and Llion Jones, two former Google researchers who are credited with innovations including Transformers, World Models, and LoRA, formed the business. To lead this initiative and establish Tokyo as a leader in AI, it has received a $30 million seed round from Jon Chu at Khosla Ventures and Brandon Reeves at Lux Capital.
Stable Code 3B: Coding on the Edge.	Stable Code 3B is a 3 billion parameter Large Language Model (LLM), allowing accurate and responsive code completion at a level on par with models such as CodeLLaMA 7b that are 2.5x larger.
OpenAI announces team to build ‘crowdsourced’ governance ideas into its models.	OpenAI says it wants to implement ideas from the public about how to ensure its future AI models “align to the values of humanity.”
OpenAI must defend ChatGPT fabrications after failing to defeat libel suit.	ChatGPT users may soon learn whether false outputs will be allowed to ruin lives.
Samsung’s S24 and S24 Plus put new AI smarts in a polished package.	The two smaller siblings of the Galaxy S24 Ultra are very similar-looking phones to last year’s devices, but they include new AI-powered features and a promise of seven years of software and security updates.
OpenAI announces first partnership with a university.	OpenAI on Thursday announced its first partnership with a higher education institution.
Mark Zuckerberg’s new goal is creating artificial general intelligence.	And he wants Meta to open source it. Eventually. Maybe.
8bit HippoAttention: Up to 3X Faster Compared to FlashAttentionV2.	8bit in neural networks is not a new concept. However, shipping 8-bit models in the real world on a large scale is challenging.
Microsoft makes its AI-powered reading tutor free.	Microsoft today made Reading Coach, its AI-powered tool that provides learners with personalized reading practice, available at no cost to anyone with a Microsoft account.
Ousted Twitter CEO Parag Agrawal is back with an AI startup; gets $30 mn in funding led by Khosla Ventures.	Agrawal is back with an artificial intelligence (AI) startup that has already raised $30 million in funding that is led by Khosla Ventures.

Resources

Link	description
Moore-AnimateAnyone.	AnimateAnyone is a fantastic video control model that animates the person in the control image by using skeletal motion and an image as input. This code replicates that work in an open manner.
surya.	Surya is a multilingual document OCR toolkit
David Attenborough narrates your life.	Using a combination of GPT4-V, top-of-the-line text-to-speech, and some computer capture software, you can have someone like David Attenborough narrate everything that is happening in your life.
Create translations that follow your speech style.	Meta has a new demo for seamless voice cloning and translation between languages. SeamlessExpressive is an AI model that aims to maintain expressive speech style elements in the translation
Vanna.	Vanna is an MIT-licensed open-source Python RAG (Retrieval-Augmented Generation) framework for SQL generation and related functionality.
GRDBIS.	Graph Relation Distillation for Efficient Biomedical Instance Segmentation
AQLM.	Official PyTorch implementation for Extreme Compression of Large Language Models via Additive Quantization
RotationDrag.	RotationDrag: Point-based Image Editing with Rotated Diffusion Features
AutoGGUF.	GGUF is a format that allows many quantization methods and is used to run models with llama cpp. The quantization is automated by this notebook; it may not be effective for all models, but it is for the majority.
Listening with LLM.	consolidate learnings on how to finetune Large Language Models (LLMs) to process audio, with the eventual goal of being able to build and host a LLM able to describe human voices.
PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding.	Generating customized, styled images is one of the most popular applications of generative picture models. Previously, DreamBooth or LoRA training was needed for this. Now, with just one picture and ID embeddings, you may significantly increase quality while lowering computing costs.
Content Consistent Super-Resolution.	Improving the Stability of Diffusion Models for Content Consistent Super-Resolution
FilCo.	This repository contains the code and data about the project: Learning to Filter Context for Retrieval-Augmented Generation
haiku_dpo .	Dataset to help align models to write correct Haiku’s.
sanity-checks-revisited.	This repository contains the code and experiments for the paper Sanity Checks Revisited: An Exploration to Repair the Model Parameter Randomisation Test
MAGNeT.	Masked Audio Generation using a Single Non-Autoregressive Transformer
Tiny Narrations.	A text-to-speech read variant of the well-known (and compact) Tiny Stories dataset is called Tiny Narrations. On the SF Compute H100 cluster, it makes use of XTTS2.
Interconnects Tools for Multimodal Blogging!.	Python tools for easily translating your blog content to podcasts & YouTube
ALMA: Advanced Language Model-based translator.	ALMA (Advanced Language Model-based TrAnslator) is a many-to-many LLM-based translation model, which adopts a new translation model paradigm: it begins with fine-tuning monolingual data and is further optimized using high-quality parallel data. This two-step fine-tuning process ensures strong translation performance.
Privy.	A privacy-first coding assistant.
UV-SAM: Adapting Segment Anything Model for Urban Village Identification.	This work presents UV-SAM, a modified version of the Segment Anything Model, and the Vision Foundation Model that may be used to precisely locate urban village borders on satellite imagery. UV-SAM provides an effective substitute for conventional field surveys by integrating various image representations to achieve accurate detection.
ml-aim.	We introduce AIM a collection of vision models pre-trained with an autoregressive generative objective.
compose-and-conquer.	Official implementation of Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis. Excell for placing objects in three-dimensional space
Vlogger.	we present Vlogger, a generic AI system for generating a minute-level video blog (i.e., vlog) of user descriptions. Different from short videos with a few seconds, vlog often contains a complex storyline with diversified scenes, which is challenging for most existing video generation approaches.
trapped-in-texture-bias.	This is the official code release for the paper Trapped in texture bias. A large-scale comparison of deep instance segmentation
MegaDolphin-120b.	MegaDolphin-2.2-120b is a transformation of Dolphin-2.2-70b
TACO(Topics in Algorithmic COde generation dataset).	TACO (Topics in Algorithmic COde generation dataset) is a dataset focused on algorithmic code generation, designed to provide a more challenging training dataset and evaluation benchmark for the code generation model field.
AlphaFold found thousands of possible psychedelics. Will its predictions help drug discovery?	Researchers have doubted how useful the AI protein-structure tool will be in discovering medicines — now they are learning how to deploy it effectively.

Perspectives

Link	description
The Case for Cyborgs.	Augmenting human intelligence beyond AI will take us much further than creating something new
Past, Present, and Future of AI with Vijay Pande.	A forty-minute contemplation about AI featuring an outlook for the future.
AI Will Transform the Global Economy. Let’s Make Sure It Benefits Humanity.	AI will affect almost 40 percent of jobs around the world, replacing some and complementing others. We need a careful balance of policies to tap its potential
AI is Not the Solution to All Our Educational Challenges.	Empowering Students with an Immersive Mindset for Navigating an Unpredictable World
The Lazy Tyranny of the Wait Calculation.	The "Wait Calculation" idea suggests holding off on undertaking certain tasks or going on a space mission to Barnard's Star until the technology has advanced enough to save a considerable amount of time and effort. This strategy must be weighed against the unpredictable nature of technological advancement and the possibility of learning losses.
What counts as plagiarism? Harvard president’s resignation sparks debate.	Allegations against Claudine Gay have left researchers arguing over academic standards and practices.
‘Set it and forget it’: automated lab uses AI and robotics to improve proteins.	A self-driving lab system spent half a year engineering enzymes to work at higher temperatures.
The consciousness wars: can scientists ever agree on how the mind works?.	There are dozens of theories of how the brain produces conscious experience, and a new type of study is testing some of them head-to-head.
Centres of Excellence in AI for global health equity — a strategic vision for LMICs.	We propose that Centres of Excellence should be established in low- and middle-income countries (LMICs) to enable artificial intelligence (AI) to deliver equity in health care.
Does generative AI help academics to do more or less?.	UK academics use generative artificial intelligence (AI) in their work mainly because it improves task efficiency, saves time and labor, and boosts competitiveness
Evaluations Are All We Need.	This essay examines the difficulties in assessing LLMs and contrasts them with assessments of employees conducted by humans. It addresses the challenge of gauging the practicality and intelligence of LLMs, emphasizing the shortcomings of existing assessment techniques and the demand for more efficient ones.
The Road To Honest AI.	Identifying and modifying honesty-related vectors within the AI or employing unrelated questions to discover lying tendencies based on the AI's response consistency are two strategies suggested by recent studies to regulate AI honesty.

Back to index

ML news: Week 8 - 14 January

Research

Link	description
GUESS:GradUally Enriching SyntheSis for Text-Driven Human Motion Generation.	A human motion from a text framework named GUESS has been introduced. It reduces intricate human stances to more abstract forms on several levels, resulting in a more steady and condensed synthesis of motion from text.
Learning to Prompt with Text Only Supervision for Vision-Language Models.	This project presents a technique to keep the generalization capabilities of CLIP-like vision-language models while adapting them for different tasks. Prompts are learned from LLM data, so labeled images are not necessary.
LLaVA-ϕ: Efficient Multi-Modal Assistant with Small Language Model.	In this paper, we introduce LLaVA-ϕ (LLaVA-Phi), an efficient multi-modal assistant that harnesses the power of the recently advanced small language model, Phi-2, to facilitate multi-modal dialogues.
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs.	We introduce V*, an LLM-guided visual search mechanism that employs the world knowledge in LLMs for efficient visual querying. When combined with an MLLM, this mechanism enhances collaborative reasoning, contextual understanding, and precise targeting of specific visual elements.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism.	The DeepSeek LLM was one of the greatest coding models available last year. In several benchmarks, it achieved closeness to GPT-3.5 (despite being probably three times larger). A technical study has been made public with details on model training, token counts, model architecture, and other topics.
Denoising Vision Transformers.	The vision community has been overtaken by Vision Transformers (ViT). They occasionally still exhibit artifacts in their embeddings that resemble grids. The community is reluctant to use them for jobs that come after because of this. This study suggests a positional embedding update that fixes this problem and provides a 25%+ performance gain for downstream vision tasks.
FED-NeRF: Achieve High 3D Consistency and Temporal Coherence for Face Video Editing on Dynamic NeRF.	A new stabilizer for smooth temporal coherence and GAN-NeRF technology for 3D consistency have been used by researchers to create a facial video editing architecture. This technique works well for editing videos since it keeps viewpoints constant and makes frame transitions smooth.
A Minimaximalist Approach to Reinforcement Learning from Human Feedback.	Self-Play Preference Optimization (SPO), a less complex alignment method than conventional RLHF, has been presented by Google researchers. Using game theory, the researchers were able to develop single-player self-play dynamics that provide good performance and are resilient to noisy preferences.
Mixtral of Experts.	We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts).
GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation.	The constraints of existing single-criterion measures have been addressed by researchers with the development of a new assessment metric for text-to-3D generative models. This sophisticated technique compares 3D objects and generates prompts using GPT-4V. It is very compatible with human tastes and provides flexibility by adjusting to different user-specified requirements.
Self-emerging Token Labeling.	Using a novel self-emerging token labeling (STL) framework, researchers have made a substantial development for Vision Transformers (ViTs) by improving the resilience of the Fully Attentional Network (FAN) models. Using this method, a FAN student model is trained after a FAN token labeler has been trained to produce relevant patch token labels.
MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning.	We propose a Multi-disciplinary Collaboration (MC) framework. The framework works in five stages: (i) expert gathering: gathering experts from distinct disciplines according to the clinical question; (ii) analysis proposition: domain experts put forward their own analysis with their expertise; (iii) report summarization: compose a summarized report on the basis of a previous series of analyses; (iv) collaborative consultation: engage the experts in discussions over the summarized report. The report will be revised iteratively until an agreement from all the experts is reached; (v) decision making: derive a final decision from the unanimous report.
DiffBody: Diffusion-based Pose and Shape Editing of Human Images.	This study presents a one-shot approach to human image editing that allows for substantial body form and position modifications without compromising the subject's identification.
LLaMA Beyond English: An Empirical Study on Language Capability Transfer.	Our evaluation results demonstrate that comparable performance to state-of-the-art transfer models can be achieved with less than 1% of the pretraining data, both in terms of knowledge alignment and response quality.
Masked Audio Generation using a Single Non-Autoregressive Transformer.	Most audio creation methods produce sounds by diffusion or an auto-regressive model. This study does not employ a complex Transformer or several stages. Rather, it employs an obscured language model on top of audio tokens.
TechGPT-2.0: A large language model project to solve the task of knowledge graph construction.	TechGPT-2.0 improves on big language models for particular applications, such as building knowledge graphs. With its emphasis on relationship triple extraction and named entity identification, the project also represents a major advancement for the Chinese open-source AI community.
Long-Context Retrieval Models with Monarch Mixer.	Compute has been investigating a variety of substitutes for Transformers. It has published a retrieval model for retrieval tasks that performs better than a lot of closed embedding models.
Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting.	It is shown that, depending on the prompt, prompting models for a small set of shot benchmarks can provide task accuracy ranging from 4 to 88%. This study demonstrates how to enhance your prompts in a scientific way.
Application of Deep Learning in Blind Motion Deblurring: Current Status and Future Prospects.	An extensive review of deep learning's application to blind motion deblurring—a crucial field in computer vision—is provided in this work. It covers everything from fundamental ideas and the drawbacks of conventional approaches to a thorough analysis of contemporary strategies including CNNs, GANs, RNNs, and Transformers.
Singer Identity Representation Learning using Self-Supervised Techniques.	A new framework has been created by researchers to examine and comprehend singing voices more thoroughly. By applying self-supervised learning on isolated vocal records and focusing on out-of-domain generalization, they achieved progress in tasks like singing voice similarity and synthesis, improving upon current technology.
Towards the Law of Capacity Gap in Distilling Language Models.	Language model (LM) distillation is a trending area that aims to distill the knowledge residing in a large teacher LM to a small student one. The law later guides us to distill a 3B student LM (termed MiniMA) from a 7B teacher LM (adapted LLaMA2-7B).

News

Link	description
Nabla raises another $24 million for its AI assistant for doctors that automatically writes clinical notes.	Paris-based startup Nabla just announced that it had raised a $24 million Series B funding round led by Cathay Innovation
OpenInterpreter gets an OS mode.	An excellent effort that mimics OpenAI's interpreter is called Open Interpreter. It can now operate your computer using a language model by pressing buttons and seeing the screen since it has both an OS mode and a visual mode.
Wave of Apple Generative AI Tools on Track for WWDC Debut.	Apple is on schedule to announce a series of generative AI-based tools at its Worldwide Developers Conference (WWDC) in June, Bloomberg's Mark Gurman reports.
A survey of 2,778 researchers shows how fragmented the AI science community is.	The "2023 Expert Survey on Progress in AI" shows that the scientific community has no consensus on the risks and opportunities of AI, but everything is moving faster than once thought.
Microsoft’s observer has reportedly joined the OpenAI board.	Now Bloomberg reports that person is Microsoft vp Dee Templeton, who has been there for 25 years and leads a team responsible for managing its relationship with OpenAI.
Microsoft, OpenAI sued for copyright infringement by nonfiction book authors in class action claim.	Two nonfiction book authors sued Microsoft and OpenAI in a putative class action complaint alleging that the defendants “simply stole” the writers’ copyrighted works to help build a billion-dollar artificial intelligence system.
OpenAI and journalism.	In response to The New York Times lawsuit, OpenAI emphasized working with news organizations, asserted that using public content for AI training is fair use, pledged to stop using rare content repeatedly in their models, and expressed surprise at the lawsuit considering their continuous efforts to address issues.
Getty and Nvidia bring generative AI to stock photos.	Generative AI by iStock lets users make their own stock photos from text prompts.
Microsoft’s new Copilot key is the first big change to Windows keyboards in 30 years.	Microsoft wants 2024 to be the year of the AI PC as it lines up bigger changes to Windows.
Rabbit foundation model and computer.	The large action model (LAM) developed by Rabbit was designed to work with the R1 pocket companion computer. Almost fully driven by its LAM, the company's R1 gadget is a reimagining of the computer and smartphone.
OpenAI’s news publisher deals reportedly top out at $5 million a year.	The ChatGPT company has been trying to get more news organizations to sign licensing deals to train AI models.
Intel: ‘We are bringing the AI PC to the car’.	The chip company is doubling down on its auto business, introducing a new AI-enhanced system-on-a-chip for cars. The first company to install it will be Zeekr.
AlphaFold’s Latest Strides: Improved Accuracy for Antibody-Antigen Complex Modeling.	A new study from the University of Maryland evaluates its accuracy and provides new insights into the factors influencing protein modeling.
Introducing the GPT Store.	OpenAI has launched the GPT store, which allows developers to get paid by building these agents. The company plans to feature GPTs every week.
Regulators aren’t convinced that Microsoft and OpenAI operate independently.	EU is fielding comments on potential market harms of Microsoft's investments.
Your private AI can have eyes. Ollama with the LLaVA model.	Vision models are now supported by Ollama. With Llava, you may enjoy cutting-edge language and vision performance on your MacBook Pro.
OpenAI debuts ChatGPT subscription aimed at small teams.	OpenAI is launching a new subscription plan for ChatGPT, its viral AI-powered chatbot, aimed at smaller, self-service-oriented teams.
Valve now allows the “vast majority” of AI-powered games on Steam.	New reporting system will enforce "guardrails" for "live-generated" AI content.
marc newson designs swarovski's world-first AI binoculars that identify species on their own.	the dubbed world’s first AI-supported binoculars that using their high-performance analog long-range optics and digital intelligence, they can detect and identify more than 9,000 birds and other wildlife at a touch of a button.
Google Cloud launches new generative AI tools for retailers.	Google launched several new AI tools for retailers to improve online shopping experiences and other retail operations.
Amazon’s Alexa gets new generative AI-powered experiences.	Today, the company revealed three developers delivering new generative AI-powered Alexa experiences, including AI chatbot platform Character.AI, AI music company Splash and Voice AI game developer Volley. All three experiences are available in the Amazon Alexa Skill Store.

Resources

Link	description
Steering Llama-2 with contrastive activation additions.	By just adding e.g. a "sycophancy vector" to one bias term, we outperform supervised fine-tuning and few-shot prompting at steering completions to be more or less sycophantic. Furthermore, these techniques are complementary: we show evidence that we can get all three benefits at once!
DiffusionEdge.	DiffusionEdge is an innovative edge detection model that works better than current techniques. Through the integration of a diffusion probabilistic model, DiffusionEdge produces resource-efficient edge maps that are more precise and clean.
Transformers From Scratch.	In this blog we’re going to walk through creating and training a transformer from scratch. We’ll go through each foundational element step by step and explain what is happening along the way.
Merge Large Language Models with mergekit.	Model merging is a technique that combines two or more LLMs into a single model. It’s a relatively new and experimental method to create new models for cheap (no GPU required). Model merging works surprisingly well and produced many state-of-the-art models on the Open LLM Leaderboard. In this tutorial, we will implement it using the mergekit library.
Mathematical Introduction to Deep Learning: Methods, Implementations, and Theory.	This book aims to provide an introduction to the topic of deep learning algorithms. We review essential components of deep learning algorithms in full mathematical detail including different artificial neural network (ANN) architectures and different optimization algorithms
Portkey's AI Gateway.	is the interface between your app and hosted LLMs. It streamlines API requests to OpenAI, Anthropic, Mistral, LLama2, Anyscale, Google Gemini and more with a unified API.
act-plus-plus.	Imitation Learning algorithms and Co-training for Mobile ALOHA
crewAI.	Cutting-edge framework for orchestrating role-playing, autonomous AI agents. By fostering collaborative intelligence, CrewAI empowers agents to work together seamlessly, tackling complex tasks.
Integrating CLIP and SAM for Enhanced Image Segmentation.	In order to enhance picture segmentation and identification, this research presents the Open-Vocabulary SAM, a framework that combines the advantages of CLIP and SAM models.
Diffusion Models for Reinforcement Learning: A Survey.	Diffusion models' contribution to RL. Their applications are categorized in this repository, which also provides links to upcoming interdisciplinary research opportunities.
tinygrad.	A very simple implementation of inference of the new Mistral MoE model using the Tinygrad library.
YouTube Transcripts → Knowledge Graphs for RAG Applications.	how to scrape YouTube video transcripts into a knowledge graph for Retrieval Augmented Generation (RAG) applications.
AI Toolkit.	AI Toolkit is a header-only C++ library that provides tools for building the brain of your game's NPCs.
SpeechAgents.	SpeechAgents is a multi-modal artificial intelligence system that can very realistically mimic human speech. With the use of a multi-modal LLM, this system can manage up to 25 agents. Its ability to imitate human language, complete with constant substance, realistic rhythms, and emotive emotions, suggests that it has promise for use in plays and audiobooks.
Model Card for Switch Transformers C - 2048 experts (1.6T parameters for 3.1 TB).	Google's switch transformer was among the first Mixture-of-Experts models to achieve success. It can now be found on the HuggingFace platform with code.
Make LLM Fine-tuning 2x faster with Unsloth and 🤗 TRL.	Pulling your hair out because LLM fine-tuning is taking forever? In this post, we introduce a lightweight tool developed by the community to make LLM fine-tuning go super fast!
distilabel Orca Pairs for DPO.	a novel technique that makes it possible to filter excellent pair preferences for alignment. It significantly raises the performance of the baseline model.
Chatbot UI.	The open-source AI chat app for everyone.
explain-then-translate.	We propose a 2-stage Chain-of-Thought (CoT) prompting technique for program translation: we ask models to explain the source programs first before translating.
WhiteRabbitNeo-33B-v1 .	Both offensive and defensive security training have been given to this model. This general-purpose coding paradigm can help with activities related to cyber security. This implies that you may use it to learn how to defend against various attacks and vulnerabilities as well as to safeguard your networks.

Perspectives

Link	description
How to Build a Thinking AI.	This article provides an analytical framework for how to simulate human-like thought processes within a computer. It describes how attention and memory should be structured, updated, and utilized to search for associative additions to the stream of thought.
The New York Times’ AI Opportunity.	In its case against OpenAI and Microsoft, the New York Times alleges that the companies' AI technologies—ChatGPT among them—were trained on millions of copyrighted articles from the newspaper, resulting in outputs that are directly competitive with the Times' services. The lawsuit challenges the legality of AI training practices and the effects of AI on traditional content creators, claiming that this amounts to copyright infringement and jeopardizes the newspaper's investment in journalism. It also demands the destruction of AI models and data that used Times content, along with billions of dollars in damages.
Does AI risk “other” the AIs?.	The idea of "othering" AIs and the moral ramifications of regulating or changing AI in the future as well as human values are the main topics of this essay's analysis of Robin Hanson's critique of the AI risk discourse. Fearing AI as an "other" is biased, according to Hanson. It's possible that Hanson's opinions undervalue the dangers of unchecked AI growth and the difficulties of bringing future AI ideals into line with human ethics.
Part One: One-Year Anniversary of ChatGPT. Has AI Become the New Tech Platform?.	The "Anatomy Framework", a tool for evaluating the disruptive potential of any breakthrough, including artificial intelligence, is introduced in this article. It examines innovation from five perspectives: apps, tools, core platform, underlying infrastructure, and ecosystem facilitators. It also covers the role of innovators, both new and established and the innovation medium (hardware vs. software).
There are holes in Europe’s AI Act — and researchers can help to fill them.	Scientists have been promised a front-row seat for the formulation of the EU’s proposed AI regulatory structures. They should seize this opportunity to bridge some big gaps.
The science events to watch for in 2024.	Advanced AI tools, Moon missions, and ultrafast supercomputers are among the developments set to shape research in the coming year.
Will superintelligent AI sneak up on us? A new study offers reassurance.	Improvements in the performance of large language models such as ChatGPT are more predictable than they seem.
AI consciousness: scientists say we urgently need answers.	Researchers call for more funding to study the boundary between conscious and unconscious systems.
AI could transform metal recycling globally.	Metal recycling needs to become more cost-efficient because it is a crucial contributor to the global circular economy and the transition to renewable energy.
Can AI make genuine theoretical discoveries?.	When Nature included ChatGPT alongside its list of ten people who helped to shape science in 2023, it seemed deliberately provocative
AI and the Future of SaaS.	Today, let’s look into the crystal ball and see a few opportunities, challenges, and threats that AI systems may pose for software entrepreneurs and creators.
Benchmarking GPT-4 Turbo - A Cautionary Tale.	GPT-4 Turbo came up slightly behind at 68.8%, while GPT-4 successfully finished 70% of the programming tasks. It's interesting to note that GPT-4 Turbo needed more tries than GPT-4, which may indicate that it lacks GPT-4's memory power. A further test supported this.
Unraveling spectral properties of kernel matrices.	This article examines the implications for learning properties of the way that eigenvalues vary for various Kernel Matrices.
NVIDIA’s CEO on Leading Through the A.I. Revolution.	In this podcast, NVIDIA CEO and co-founder Jensen Huang shares his thoughts on how he steers his company through rapidly changing times and offers advice to other entrepreneurs on how to stay competitive by incorporating AI into their operations.
It’s Humans All the Way Down.	Because everyone believes that everyone else's work is simple, people believe that AI will replace a lot of employment. Ignorance is the foundation for the desire to exclude humans from the equation. It is impossible to ignore the fact that people matter, even in the craziest of ideas. Humans want to be seen and understood by other humans.

Back to index

ML news: Week 1 - 7 January

Research

Link	description
MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining.	MosaicBERT is a custom BERT architecture optimized for fast pretraining. This study motivated many of the architecture choices around MosaicML's MPT-7B and MPT-30B models. the main architectural modifications used: FlashAttention, ALiBi, Gated Linear Units, Low Precision LayerNorm.
Improving Text Embeddings with Large Language Models.	Microsoft researchers trained a decoder-only transformer based on Mistral for embeddings using synthetic data. In the class, it is the best. Remarkably, they create the synthetic retrieval training data using GPT-4 and a two-step prompting technique.
Images altered to trick machine vision can influence humans too.	New research shows that even subtle changes to digital images, designed to confuse computer vision systems, can also affect human perception
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models.	The fact that existing language models require very expensive human preference data to function properly is one of their main disadvantages. Determining if it is possible to have language models self-play develop without gathering this data has emerged as a major area of current study. With only SFT data, a new technique called SPIN makes significant progress in that direction by significantly enhancing a basic model's performance on a variety of tasks.
Boundary Attention: Learning to Find Faint Boundaries at Any Resolution.	Identifying edges and curves in pictures is a traditional computer vision challenge. Nevertheless, many existing approaches perform poorly when noise, quality changes, or out-of-distribution instances are introduced. With just 207k parameters, this newly discovered approach works very well on sensor readings. It significantly advances state of the art and employs a two-stage training procedure.
Bracketing is All You Need: Unifying Image Restoration and Enhancement Tasks with Multi-Exposure Images.	This work uses a unique temporally modulated recurrent network (TMRNet) with bracketing photography to achieve a considerable improvement in low-light photo quality. This method surpasses current multi-image processing techniques by training with synthetic data and adapting to real-world pictures.
Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation.	The Auffusion system presents a breakthrough in Text-to-Audio (TTA) creation, inspired by Text-to-Image diffusion models. It is quite good at turning text into high-quality audio, especially with complicated inputs.
Context-Aware Interaction Network for RGB-T Semantic Segmentation.	CAINet is an innovative technique that researchers have developed to improve RGB-T semantic segmentation, which is important for autonomous driving. This system mixes many data kinds in a unique way, emphasizing the complementary qualities and global context of each form of data.
3D-Aware Visual Question Answering about Parts, Poses and Occlusions.	Although there has been progress in Visual Question Answering (VQA), most models focus primarily on 2D reasoning and ignore the intricacy of 3D visual settings. This study introduces 3D-aware VQA.
DocLLM.	We present DocLLM, a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents, taking into account both textual semantics and spatial layout
GPT-4V(ision) is a Generalist Web Agent.	In this work, we explore the potential of LMMs like GPT-4V as a generalist web agent that can follow natural language instructions to complete tasks on any given website.
Fast Inference of Mixture-of-Experts Language Models with Offloading.	With the widespread adoption of Large Language Models (LLMs), many deep learning practitioners are looking for strategies for running these models more efficiently. One such strategy is to use a sparse mixture of experts (MoE). In this work, we study the problem of running large MoE language models on consumer hardware with limited accelerator memory.
LLM Augmented LLMs: Expanding Capabilities through Composition.	investigate combining specialized models with preexisting foundation models to increase capabilities; introduce cross-attention between models to combine representations that allow for new capabilities. For instance, a PaLM2-S model was enhanced with a smaller model trained on low-resource languages to enhance English translation and arithmetic reasoning for low-resource languages; this was also accomplished with a code-specific model that produced a 40% improvement in code generation and explanation tasks compared to the base code model.
LLaMA Pro.	provides a post-pretraining technique to enhance an LLM's knowledge without causing catastrophic forgetting; it does this by freezing the inherited blocks and tuning expanded identity blocks using only new corpus; trains an LLaMA Pro-8.3B initialized from Llama2-7B using code and math data; these models outperform base models on a variety of benchmarks while maintaining the original general capabilities.
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models.	demonstrates that a supervised fine-tuned LLM can be made better without needing to obtain more human-annotated data. Drawing inspiration from self-play, it uses the LLM to create its training data from prior iterations, then refines its policy by separating the responses it generated from the human-annotated data. This shows that the method can make the LLM perform better and outperform models trained via DPO with GPT-4 preference data.

News

Link	description
Microsoft’s Copilot app is now available on iOS.	The Microsoft Copilot app lets you ask questions, draft text, and generate images using AI.
Stuff we figured out about AI in 2023.	This piece aims to summarize the major advancements in AI research throughout the course of 2023. It addresses a number of topics, including LLM applications, the issue of gullibility, model tweaking, and how to execute LLMs on personal devices. When used appropriately, LLMs can significantly improve the quality of life for those who use them. Although they are really rather simple to construct, many applications still find them to be unstable and there is still plenty to learn about them.
Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models.	This study conducts a thorough evaluation of Gemini Pro's efficacy in commonsense reasoning tasks, employing a diverse array of datasets that span both language-based and multimodal scenarios.
Noise-free Optimization in Early Training Steps for Image Super-Resolution.	By concentrating on two crucial elements—the ideal centroid of possible high-resolution images and the intrinsic noise that degrades image quality—researchers have created a novel technique that enhances single image super-resolution.
AI-created “virtual influencers” are stealing business from humans.	Brands are turning to hyper-realistic, AI-generated influencers for promotions.
DeepMind AI outdoes human mathematicians on unsolved problem.	Large language model improves on efforts to solve combinatorics problems inspired by the card game Set.
Nikon, Sony, and Canon fight AI fakes with new camera tech.	Digital signatures to provide a way to tell real photos from deep fakes
Intel to spin out AI software firm with outside investment.	Intel on Wednesday said it was forming a new independent company around its artificial intelligence software efforts with backing from digital-focused asset manager DigitalBridge Group and other investors.
Search startup Perplexity AI valued at $520 mln in funding from Bezos, Nvidia.	Search startup Perplexity AI has raised $73.6 million from a group of investors including Nvidia
OpenAI’s app store for GPTs will launch next week.	OpenAI plans to launch a store for GPTs, custom apps based on its text-generating AI models (e.g. GPT-4), sometime in the coming week.
Google appears to be working on an ‘advanced’ version of Bard that you have to pay for.	Google might be on track to release a Gemini Ultra-powered Bard Advanced.
LLM Training and Inference with Intel Gaudi 2 AI Accelerators.	Excellent training throughput, flops, and decoding bandwidth are features of the new Intel processor, which is accessible for on-premise deployment across many platforms.
GitHub makes Copilot Chat generally available, letting devs ask questions about code.	GitHub’s launching Chat in general availability for all users.

Resources

Link	description
llm-course.	Course to get into Large Language Models (LLMs) with roadmaps and Colab notebooks.
Bash One-Liners for LLMs.	A project called Llamafile combines the inference and model code into a single portable executable. In order to handle command line output further, this blog post explains how to do so.
pykoi: RLHF/RLAIF in one unified interface.	pykoi is an open-source Python library for improving LLMs with RLHF. We provide a unified interface including RLHF/RLAIF data and feedback collection, finetuning with reinforcement learning and reward modeling, and LLM comparisons.
gpt-fast.	Simple and efficient pytorch-native transformer text generation.
TinyGPT-V.	TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
sbs-generator.	This repository contains a framework for converting monocular videos into side-by-side (SBS) 3D videos. It utilizes a combination of image processing techniques and depth map predictions to generate separate views for each eye, creating a 3D effect when viewed with appropriate hardware.
ColBERTv2: Indexing & Search Notebook.	ColBERT is a cutting-edge generation and retrieval technique. To assist readers in getting up to speed and experimenting with the technique, the authors have included a notepad.
intel-extension-for-transformers.	An Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere
aMUSEd: An Open MUSE Reproduction.	We present aMUSEd, an open-source, lightweight masked image model (MIM) for text-to-image generation based on MUSE. With 10 percent of MUSE's parameters, aMUSEd is focused on fast image generation.
RAGatouille.	Easily use and train state-of-the-art retrieval methods in any RAG pipeline. Designed for modularity and ease of use, backed by research.
ODTrack.	ODTrack is a simple, flexible, and effective video-level tracking pipeline, which densely associates the contextual relationships of video frames in an online token propagation manner.
ARLib.	An open-source framework for conducting data poisoning attacks on recommendation systems, designed to assist researchers and practitioners.
Learning JAX as a PyTorch developer.	Some ideas about the transition from Pytorch to Jax. This post explains nine key ideas that set Jax apart and make it effective; each is illustrated with a lovely piece of code.
Mitigating Hallucination in LLMs.	A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models
If LLM Is the Wizard, Then Code Is the Wand.	A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents

Perspectives

Link	description
How IBM Sees AI Changing the Game for Companies of All Sizes with IBM’s VP of Technology and Director of Startups.	AI technology is revolutionizing a variety of sectors' business landscapes. In this article, IBM's Director of Startups Kylie Rutherford, and Vice President of Software and Technology Raj Datta discuss how artificial intelligence (AI) is transforming business for organizations of all kinds and provide several use examples for different products.
LLMs and Programming in the first days of 2024.	Large Language Models (LLMs) have greatly accelerated code creation and comprehension of intricate APIs or frameworks in 2023, making them indispensable for programmers. LLMs perform well at high-level Python coding and routine chores, but they are less effective at sophisticated system programming. They may also be used as a simplified form of documentation and as an effective method for increasing productivity.
Surge in number of ‘extremely productive’ authors concerns scientists.	Some researchers publish a new paper every five days, on average. Data trackers suspect not all their manuscripts were produced through honest labor.
Satellite images reveal untracked human activity on the oceans.	Machine learning and satellite imagery have been used to map industrial infrastructure at sea — from fishing vessels to wind turbines. The findings provide a more comprehensive picture of maritime activity than ever before.
Revealing the ‘Clever Hans Effect’ in AI-Driven Drug Discovery.	In a landmark study at the University of Bonn, a team led by Prof. Dr. Jürgen Bajorath has revealed a significant finding about the role of artificial intelligence (AI) in pharmaceutical research.
What We Learned About AI and Education in 2023.	From Disruption to Integration: AI Responsive Education in 2023
The AI trust crisis.	Users worry that their data may be used to train OpenAI's models as a result of Dropbox's new AI features, even though Dropbox has denied this and has a policy requiring customer agreement for such usage. This circumstance draws attention to a larger crisis of confidence in AI and data privacy, highlighting the necessity of corporations communicating clearly and being open about how they use data.
The official OpenAI prompt engineering guide.	a thorough, step-by-step manual that outlines methods and techniques for improving performance with big language models such as GPT-4.

Back to index

2023

ML news: Week 18 - 24 December

Research

Link	description
Stabilizing Transformer Training by Preventing Attention Entropy Collapse.	Despite their incredible skills, transformers may be challenging to train because of their numerous instabilities. When the entropy of an Attention matrix collapses, it is one of the primary problems. With a straightforward reparametrization, our work offers a means to avoid it.
DiffusionLight: Light Probes for Free by Painting a Chrome Ball.	This effort overcomes the drawbacks of existing approaches that rely on HDR panorama datasets by introducing a unique method for predicting lighting in photos. The method uncovers a distinct link between chrome balls and diffusion noise by rendering chrome balls into conventional pictures using diffusion models.
DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving.	A new framework called DriveMLM takes advantage of massive language models to improve autonomous driving. This system performs better in simulations and interacts with current autonomous driving systems. It does this by fusing linguistic judgments with vehicle controls.
Graph Neural Networks with Diverse Spectral Filtering.	A novel technique known as DSF has been created by researchers to enhance spectral graph neural networks. The World Wide Web and other complicated networks can be handled more effectively by DSF by adding node-specific filter weights.
Evaluating and Mitigating Discrimination in Language Model Decisions.	A proactive approach to assessing language models' potential for discrimination is covered in this article. The process involves coming up with a broad range of possible prompts for different decision scenarios and variations in demographic data. The main tactic for reducing both positive and negative discrimination is cautious prompt engineering.
Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion Models.	A comparison of UNet encoders and decoders in diffusion models demonstrates the former's more stable behavior. Thanks to this realization, a novel encoder propagation strategy was developed, greatly accelerating jobs like text-to-image and text-to-video production.
VideoPoet: A large language model for zero-shot video generation.	VideoPoet, a large language model (LLM) that is capable of a wide variety of video generation tasks, including text-to-video, image-to-video, video stylization, video inpainting and outpainting, and video-to-audio.
Unmasking Deepfake Faces from Videos Using An Explainable Cost-Sensitive Deep Learning Approach.	a deep learning method for identifying deepfake faces in videos utilizing four pre-trained CNN models for high accuracy. official code.
Bi-directional Adapter for Multi-modal Tracking.	This effort solves the drawbacks of single-modal object tracking by introducing a multi-modal visual cue tracking paradigm that dynamically leverages the advantages of several modalities, such as RGB and infrared. official code.
Tokenize Anything via Prompting.	We present Tokenize Anything via Prompting, a unified and promptable model capable of simultaneously segmenting, recognizing, and captioning arbitrary regions, with flexible visual prompts (point, box, and sketch).
Gemini: A Family of Highly Capable Multimodal Models.	Gemini Paper: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding.
FontDiffuser: One-Shot Font Generation via Denoising Diffusion with Multi-Scale Content Aggregation and Style Contrastive Learning.	Diffusion-based FontDiffuser is an automatic font production technique that works especially well with intricate characters and a wide range of style variants. It has a Style Contrastive Refinement module for style transfer and a Multi-scale Content Aggregation block for improved stroke preservation.
Splatter Image: Ultra-Fast Single-View 3D Reconstruction.	The Splatter Image is an ultra-fast method for single- and few-view 3D reconstruction. Training is done on 1 GPU, reconstruction at 38 FPS, and rendering at 588 FPS.
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU.	The hypothesis that models include cool neurons that are utilized significantly less frequently and hot neurons that are used for nearly all inputs is investigated in this research. It is possible to conserve memory without significantly reducing throughput by preloading the hot neurons to the GPU. There's a code library companion available.
On Inference Stability for Diffusion Models.	A 'sequence-aware' loss function has been created by researchers to enhance Denoising Probabilistic Models (DPMs) and solve the problem of timestep correlation in picture production. Better FID and Inception Scores demonstrate that this new method provides a tighter estimation of loss and significantly improves picture quality on datasets such as CelebA and CIFAR10.
CLIP-DINOiser: Teaching CLIP a few DINO tricks.	For better semantic segmentation without annotations, the novel CLIP-DINOiser approach combines self-supervised features with the zero-shot capabilities of the CLIP model.
A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models.	A novel technique called UDiffText improves the legibility of text in AI-generated graphics. Through the use of a sophisticated text encoder and extensive dataset fine-tuning, UDiffText enhances text correctness and dramatically lowers spelling errors.

News

Link	description
Snapchat+ subscribers can now create and send AI-generated images.	Snapchat is releasing a few new AI powered features for Snapchat+ subscribers.
Google brings Gemini Pro to Vertex AI.	After coming to Bard and the Pixel 8 Pro last week, Gemini, Google’s recently announced flagship GenAI model family, is launching for Google Cloud customers using Vertex AI.
Competitive performance claims and industry-leading Inference performance on AMD Instinct MI300X.	With ROCm 6, AMD's flagship AI accelerator, the MI300X, can now perform for inference tasks on par with NVIDIA. The community will benefit greatly from this as it gives burgeoning AI firms access to new chips.
Agility is using large language models to communicate with its humanoid robots.	the company is showcasing some of that work in a short video shared through its social channels.
Intel unveils new data center chip with focus on AI growth.	On Thursday, the company debuted its 5th Gen Intel Xeon processors during its AI Everywhere event
ByteDance is secretly using OpenAI’s tech to build a competitor.	‘They really just don’t want to get caught.’ The frenzied race to win in generative AI means that even the biggest players are cutting corners.
Atlassian welcomes AI to the team.	AI capabilities are now generally available across Jira Software, Confluence, Jira Service Management, and more.
Harvey raises $80M Series B.	A new round of investment has been secured by Harvey AI, a legal service based on OpenAI technology, valuing the startup business at over $700 million. Using OpenAI, the business creates foundation models for applications related to law and legal practice.
whiterabbitneo/WhiteRabbitNeo-13B.	The Matrix-related startup whiterabbitneo has produced a 13B parameter language model that has been trained for both offensive and defensive cyber security. It is qualified to respond to inquiries and offer details on computer security.
Anthropic Updates Terms of Service.	We are introducing new, simplified Commercial Terms of Service with an expanded copyright indemnity, as well as an improved developer experience with our beta Messages API.
Pilotless FedEx, Reliable Robotics Plane Completes Flight.	Milestone for Autonomous Flight Could Lead to Regulatory Approval
Microsoft Copilot gets a music creation feature via Suno integration.	Microsoft Copilot, Microsoft’s AI-powered chatbot, can now compose songs thanks to an integration with GenAI music app Suno.
Stability AI announces paid membership for commercial use of its models.	The company said paid tiers will fund the future of its AI research.
More than 10,000 research papers were retracted in 2023 — a new record.	The number of articles being retracted rose sharply this year. Integrity experts say that this is only the tip of the iceberg.
Large language models direct automated chemistry laboratory.	Automation of chemistry research has focused on developing robots to execute jobs. Artificial intelligence technology has now been used not only to control robots but also to plan their tasks on the basis of simple human prompts. research article.
Introducing Text-to-CAD.	After changing its name, Zoo Dev (formerly Kitty Cad) unveiled a new text-to-cad feature. This robust platform creates 3D assets that can be printed or used as parts.
Waymo finds that its driverless cars ‘significantly outperformed’ humans.	New safety research from Waymo finds that its driverless cars “led to a significant reduction in the rates of police-reported and injury-causing crashes compared to human drivers.”
Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model.	A novel training-free framework called Diff-Text enables the creation of photo-realistic images with text in any language. Using sketched images as priors, it improves the multilingual capabilities of the Stable Diffusion model.
Largest Dataset Powering AI Images Removed After Discovery of Child Sexual Abuse Material.	The model is a massive part of the AI-ecosystem, used by Stable Diffusion and other major generative AI products. The removal follows discoveries made by Stanford researchers, who found thousands of instances of suspected child sexual abuse material in the dataset.
Rite Aid banned from using facial recognition software after falsely identifying shoplifters.	FTC says the company's 'reckless use' of AI humiliated customers
AI startup Anthropic reportedly in talks to raise $750M on a $15B valuation.	Anthropic PBC, an artificial intelligence startup backed by Amazon.com Inc. and Google LLC, is reportedly in talks to raise $750 million in new funding at a valuation of $15 billion
Apple’s latest AI research could completely transform your iPhone.	two new papers introducing new techniques for 3D avatars and efficient language model inference. The advancements could enable more immersive visual experiences and allow complex AI systems to run on consumer devices such as the iPhone and iPad.
OpenAI buffs safety team and gives board veto power on risky AI.	A new safety advisory group has been established by OpenAI, and the board has been given veto power over all models.

Resources

Link	description
Introducing Ego-Exo4D: A foundational dataset for research on video learning and multimodal perception.	Ego-Exo4D, a foundational dataset and benchmark suite to support research on video learning and multimodal perception.
Coffee.	Coffee, which was released last week, integrates AI into current codebases to speed up front-end development. The project primarily focuses on a first-class DX, drawing on insights the Coframe team has gained from producing more than 80% of their front-end using AI.
DeepEval.	DeepEval is a simple-to-use, open-source evaluation framework for LLM applications. It is similar to Pytest but specialized for unit testing LLM applications. DeepEval evaluates performance based on metrics such as hallucination, answer relevancy, RAGAS, etc., using LLMs and various other NLP models locally on your machine.
Understanding GPU Memory 1: Visualizing All Allocations over Time.	Determining the reason behind memory leaks in Pytorch has proven to be one of the most difficult tasks for practitioners. Pytorch 2.1 has some incredible new features that provide insight into memory utilization. Classifying the utilization into well-known buckets (such as activations and gradients) is another application for it.
Fine Tuning Mistral 7B on Magic the Gathering Drafts.	Tips, examples, and thoughts from an exploration of the world of fine tuning
MMLU prompt templates.	The most effective prompting technique for MMLU at the moment is Microsoft's Medprompt+. The template, along with numerous other chain-of-thought style templates that are widely used in the evaluation community, was made available by Microsoft.
Amphion.	Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development. Amphion offers a unique feature: visualizations of classic models or architectures.
Big Vision.	This codebase is designed for training large-scale vision models using Cloud TPU VMs or GPU machines. It is based on Jax/Flax libraries and uses tf. data and TensorFlow Datasets for scalable and reproducible input pipelines.
Thoughts on Jaxtyping.	In machine learning, shape problems are difficult to debug and can go undetected until the model is attempted to run. Checking shapes as a kind can help you get over most of these obstacles and advance faster.
Legaltech x AI: The Lightspeed View.	Lightspeed's perspective on the legal tech industry's use of AI is obvious and intriguing when viewed from the angle that what's good for VCs is good for everyone. Time will tell if they are on the right track or not, but there are some intriguing observations.
New Sequence Mixers.	The folks who brought us Mamba (and a ton of other models) have published a nice blog entry explaining simple sequence mixing topologies that provide some very significant speed increases over traditional Transformers.
microagents.	Agents Capable of Self-Editing Their Prompts / Python Code
LLMLingua.	LLMLingua utilizes a compact, well-trained language model (e.g., GPT2-small, LLaMA-7B) to identify and remove non-essential tokens in prompts. This approach enables efficient inference with large language models (LLMs), achieving up to 20x compression with minimal performance loss.
Distil-Whisper.	Distil-Whisper is a distilled version of Whisper that is 6 times faster, 49% smaller, and performs within 1% word error rate (WER) on out-of-distribution evaluation sets
M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts.	M3DBench introduces a comprehensive 3D instruction-following dataset that encompasses a variety of 3D vision-centric tasks, spanning fundamental abilities in real-world 3D environments.
WhisperPlus: Advancing Speech-to-Text Processing.	Advanced speech-to-text processing
tinyzero.	Easily train AlphaZero-like agents in any environment you want!
MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation.	An improvement on the original MossFormer, the MossFormer2 model provides better monaural speech separation capabilities.
llama-recipes.	The 'llama-recipes' repository is a companion to the Llama 2 model. This repository aims to provide examples to quickly get started with fine-tuning for domain adaptation and how to run inference for the fine-tuned models.
LLaVA-Interactive.	LLaVA-Interactive is a large language-and-vision assistant demo, dedicated to demonstrating the possibilities of multimodal human-machine interaction: visual input, visual output, and visual interaction. It combines complementary skills from three models: visual chat of LLaVA, visual prompt for segmentation from SEEM, and visual prompt for image generation/editing from GLIGEN.
Whisper Turbo.	Whisper Turbo is a fast, cross-platform Whisper implementation, designed to run entirely client-side in your browser/electron app.

Perspectives

Link	description
The Where, When, and How of AI with Theory Ventures, Open AI, MotherDuck and Lamini.	Prominent inventors and venture capitalists discuss the latest developments in artificial intelligence, ranging from the use of LLMs to enterprise innovation. This is a helpful fast summary if the speed of "things you should know about AI" is a little overwhelming.
The Competition is Coming for Nvidia.	After a long, largely unimpeded run, NVIDIA’s challenge has finally arrived.
From Einstein to AI: how 100 years have shaped science.	Looking back a century reveals how much the research landscape has changed — and how unclear the consequences of scientific innovation can be.
ChatGPT and science: the AI system was a force in 2023 — for good and bad.	The poster child for generative AI software is a startling human mimic. It represents a potential new era in research but brings risks.
What was the Turing test actually about?.	It is important to develop metrics for the public scrutiny of today’s generative artificial intelligence.
Should scientists delegate their writing to ChatGPT?.	Scientists should exercise caution when using generative artificial intelligence (AI) tools such as ChatGPT to write grant applications
Mentor–trainee dialogue on proper use of AI tools.	The responsible use of artificial intelligence (AI) tools in education and academia is important on a micro- as well as a macro scale
My jaw hit the floor when I watched an AI master one of the world's toughest physical games in just six hours.	An AI just mastered Labyrinth in six hours, and I am questioning my own existence.
Meta’s CTO on how the generative AI craze has spurred the company to ‘change it up’.	Andrew Bosworth, Chief Technology Officer at Meta, discusses the company's future plans and the hype around artificial intelligence.
End of YearPay Report 2023.	Levels.fyi's annual compensation report. View top-paying companies, cities, titles & other trends.
Year One of Generative AI: Six Key Trends.	In this post, drawing on countless founder meetings and pitch decks, we distill our first-hand learnings into six trends that have defined the generative AI space throughout 2023 and that are set to shape its trajectory in 2024.
Marketplaces in the Age of AI.	For the past 20 years, marketplaces have dominated company models. This is a summary from a16z discussing their predictions on how AI would affect this kind of business. Customizing the experience for both sides of the marketplace is the main concept.

Back to index

ML news: Week 11 - 17 December

Research

Link	description
RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models.	RAVE is a technique for video editing that improves videos by utilizing pre-existing text-to-image diffusion models. With this method, it is possible to achieve excellent video edits without sacrificing the original composition and flow.
Language-driven Scene Synthesis using Multi-conditional Diffusion Model.	Textual cues give scene creation—which is influenced by things like human movement or room design—a new perspective. This repository presents a new method that effectively combines text, movement, and pre-existing objects: a multi-conditional diffusion model.
Audiobox: Unified Audio Generation with Natural Language Prompts.	Meta has hinted at an AI model for audio foundations. The paper, along with more samples and powerful demos, has been made available. Producing controlled audio content with styles derived from the same model was the primary objective of the project.
BioCLIP: A Vision Foundation Model for the Tree of Life.	A vision model with applications in biology in mind. On certain physiological tasks, it performs about 20% better than OpenAI's clip. There is also a training set of 10 million paired images and text.
Diversifying Spatial-Temporal Perception for Video Domain Generalization.	A novel model called the Spatial-Temporal Diversification Network (STDN) examines relationships over time as well as spatial elements within frames to identify a range of cues in movies.
Train Once, Get a Family: State-Adaptive Balances for Offline-to-Online Reinforcement Learning.	FamO2O is a framework that researchers have developed to improve the performance of existing offline-to-online reinforcement learning algorithms by figuring out how to best balance limitations and improvement depending on the state.
Promising or Elusive? Unsupervised Object Segmentation from Real-world Single Images.	This study explores the challenge of utilizing unsupervised models to segment items in real-world photographs.
Phi-2: The surprising power of small language models.	Azure's Phi 2 is the latest in a line of small language models that were mostly trained on synthetic data. The performance of 13B parameter models is matched by the 2.7B parameter model. The difficult part of this task is identifying and addressing "test set rephrasing," although the model is very effective in any scenario.
DiAD: A Diffusion-based Framework for Multi-class Anomaly Detection.	DiAD uses diffusion models to its advantage in order to find abnormalities. To precisely identify and pinpoint abnormalities in multi-class environments, it integrates a pixel-space autoencoder, a Semantic-Guided (SG) network, and a feature-space extractor in a novel way.
Learning Naturally Aggregated Appearance for Efficient 3D Editing.	A new technique called AGAP makes 3D editing easier. AGAP enables users to effortlessly update 3D scenes without having to re-optimize for each change by utilizing a 2D image known as a canonical image.
Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior.	A novel framework called Sherpa3D enhances the production of 3D content via text prompts. By employing coarse 3D information to direct the creation process, it combines the advantages of 2D and 3D diffusion models. Thus, the constraints of current technologies are overcome and high-quality, diversified, and geometrically consistent 3D assets are produced.
Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models.	This work presents a technique for reduced order modeling-based big language model compression that greatly minimizes memory and time requirements without requiring expensive hardware.
Towards a Generalized Multimodal Foundation Model.	With the help of FIND, AI models now have a flexible interface that improves their comprehension of datasets and visuals without changing the fundamental model.
HyperRouter: Towards Efficient Training and Inference of Sparse Mixture of Experts via HyperNetwork.	The HyperRouter technique dynamically modifies router characteristics to increase the effectiveness of training big language models.
FunSearch: Making new discoveries in mathematical sciences using Large Language Models.	By searching for “functions” written in computer code, FunSearch made the first discoveries in open problems in mathematical sciences using LLMs. scientific article.
Weak to Strong Generalization.	An analog to weak people aligning super intelligent models, this new discovery from the OpenAI super alignment team (with code) implies that you may utilize much weaker supervisor models to guide or align a considerably more powerful model. To regain much of the alignment performance of GPT-4, they employed GPT-2. Among the key differences between this approach and RLHF-like approaches is that the former provides a tractable route for significant improvement.
Chat-3D v2: Bridging 3D Scene and Large Language Models with Object Identifiers.	Object identifiers are included in Large Language Models by a novel research technique aimed at enhancing comprehension and providing answers about 3D situations. This approach, which focuses on finding and connecting items in a scene, has demonstrated encouraging outcomes in terms of improving AI's comprehension of intricate spatial connections.
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention.	SwitchHead is a breakthrough in improving the effectiveness of AI models. Transformers' memory and processing requirements are decreased without sacrificing functionality.
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models.	offers a method for self-training with feedback that can significantly lessen reliance on data created by humans; when model-generated data is used in conjunction with a reward function, LLM performance on problem-solving tasks is enhanced.

News

Link	description
French AI start-up Mistral secures €2bn valuation.	Eight-month-old group set to close roughly €400mn funding round as early as Friday, in new deal lead by Andreessen Horowitz. Meanwhile, it unveils its platform with new models, an embedding, and instruction-tuned models.
Google unveils AlphaCode 2, powered by Gemini.	Alongside its Gemini generative AI model, Google this morning took the wraps off of AlphaCode 2, an improved version of the code-generating AlphaCode
Introducing Stable LM Zephyr 3B.	Stable LM Zephyr 3B is a 3 billion parameter Large Language Model (LLM), 60% smaller than 7B models, allowing accurate, and responsive output on a variety of devices without requiring high-end hardware.
Better, Cheaper, Faster LLM Alignment with KTO.	a method called Kahneman-Tversky Optimization (KTO) that makes it easier and cheaper than ever before to align LLMs on your data without compromising performance.
Paving the way to efficient architectures: StripedHyena-7B.	StripedHyena builds on the many lessons learned in the past year on designing efficient sequence modeling architectures: H3, Hyena, HyenaDNA, and Monarch Mixer.
FTC looking into Microsoft's investment in OpenAI: Report.	Following UK regulators' probe, the Federal Trade Commission (FTC) is inspecting Microsoft's (MSFT) relationship with and investment in artificial intelligence developer OpenAI, according to Bloomberg.
Liquid AI, a new MIT spinoff, wants to build an entirely new type of AI.	An MIT spinoff co-founded by robotics luminary Daniela Rus aims to build general-purpose AI systems powered by a relatively new type of AI model called a liquid neural network.
Microsoft and Labor Unions Form ‘Historic’ Alliance on AI.	Microsoft Corp. is teaming up with labor unions to create “an open dialogue” on how artificial intelligence will impact workers.
Europe reaches a deal on the world’s first comprehensive AI rules.	European Union negotiators clinched a deal Friday on the world’s first comprehensive artificial intelligence rules, paving the way for legal oversight of AI technology that has promised to transform everyday life and spurred warnings of existential dangers to humanity.
OpenAI leaders warned of abusive behavior before Sam Altman’s ouster.	Sam Altman was briefly fired by OpenAI after a group of top leaders expressed concerns to the board about his claimed psychological abuse, which included inciting conflict amongst staff and causing turmoil. There were also allegations of dishonesty in Altman's board discussions. Threats of widespread resignations and resounding staff support led to Altman's restoration; yet, the incident has raised doubts about the company's future course.
MIT group releases white papers on governance of AI.	The series aims to help policymakers create better oversight of AI in society.
Google weighs Gemini AI project to tell people their life story using phone data, photos.	“Project Ellmann” is an internal Google proposal to use artificial intelligence to help users get a “bird’s-eye view” of their life stories. The idea would be to use LLMs like Gemini to ingest search results, spot patterns in a user’s photos, create a chatbot and “answer previously impossible questions” about a person’s life. The team also demonstrated “Ellmann Chat,” with the description “Imagine opening ChatGPT but it already knows everything about your life.”
a16z Open Source AI Grant recipients.	This program is designed to support a thriving open-source ecosystem around modern AI. We provide grant funding (not an investment) to developers and small teams who are building critical pieces of the open-source AI stack.
Google debuts Imagen 2 with text and logo generation.	Google’s making the second generation of Imagen, its AI model that can create and edit images given a text prompt, more widely available — at least to Google Cloud customers using Vertex AI who’ve been approved for access.
First Impressions with Google’s Gemini.	The Roboflow team has analyzed Gemini across a range of standard prompts that we have used to evaluate other LMMs, including GPT-4 with Vision, LLaVA, and CogVLM. Our goal is to better understand what Gemini can and cannot do well at the time of writing this piece.
OpenAI inks deal with Axel Springer on licensing news for model training.	OpenAI today announced that it’s reached an agreement with Axel Springer, the Berlin-based owner of publications including Business Insider and Politico, to train its generative AI models on the publisher’s content and add recent Axel Springer-published articles to OpenAI’s viral AI-powered chatbot ChatGPT.
OpenAI's Superalignment Fast Grants.	We’re launching $10M in grants to support technical research toward the alignment and safety of superhuman AI systems, including weak-to-strong generalization, interpretability, scalable oversight, and more.
A new old kind of R&D lab.	Respond AI is a new lab that seeks to find genuinely useful and productive applications for current models rather than creating new ones. Its goal is to do basic research to assist enterprises in enabling AI-enabled application cases.
Samsung unveils its generative AI model Samsung Gauss.	Samsung Gauss consists of language, code, and image models and will be applied to the company's various products in the future.

Resources

Link	description
Towards 100x Speedup: Full Stack Transformer Inference Optimization.	Deployment optimizations are getting more and more popular as open models prove beneficial for various enterprise needs. But the terrain is uneven and complicated. This article provides a good in-depth analysis of numerous common methods for accelerating language model serving.
Pearl - A Production-ready Reinforcement Learning AI Agent Library.	Pearl is a new production-ready Reinforcement Learning AI agent library open-sourced by the Applied Reinforcement Learning team at Meta.
Running Dolphin Locally with Ollama.	For llama cpp models, ollama functions similarly to package management. Its characteristics for quality-of-life usability make modeling simple, even while using a CPU. The two fantastic models, Samantha and Dolphin, which are good unfiltered models helpful for conversational activities, are demonstrated in this example.
LLM Visualization.	Welcome to the walkthrough of the GPT large language model! Here we'll explore the model nano-gpt, with a mere 85,000 parameters.
giskard.	The testing framework dedicated to ML models, from tabular to LLMs
BricksLLM: AI Gateway For Putting LLM In Production.	BricksLLM is a cloud-native AI gateway written in Go. Currently, it serves as a proxy for OpenAI. We let you create API keys that have rate limits, cost limits, and TTLs.
KwaiAgents.	KwaiAgents is a series of Agent-related works open-sourced by the KwaiKEG from Kuaishou Technology
Now add a walrus: Prompt engineering in DALL-E 3.	An experiment using DALL-E 3 that shows how various prompts produce a range of visuals and how additional prompts hone these images.
HuggingFace gets AMD support.	The new Mistral model, AMD compatibility, safe sensors by default, and more are included in Transformers 4.36.0!
AI Tamago.	A 100% local, LLM-generated, and driven virtual pet with thoughts, feelings, and feedback. Revive your fond memories of Tamagotchi!
Introducing gigaGPT: GPT-3 sized models in 565 lines of code.	GigaGPT is Cerebras’ implementation of Andrei Karpathy’s nanoGPT – the simplest and most compact code base to train and fine-tune GPT models.
Awesome CLIP in Medical Imaging.	This is a collection of awesome articles about CLIP in medical imaging
Roadmap To Learn Generative AI In 2024.	A repository with links and videos
Lookup Table meets Local Laplacian Filter: Pyramid Reconstruction Network for Tone Mapping.	LLF-LUT is an effective end-to-end framework for the HDR image tone mapping task performing global tone manipulation while preserving local edge details.
Obsidian: Worlds smallest multi-modal LLM. First multi-modal model in size 3B.	Obsidian-3B-V0.5 is a multi-modal AI model that has vision! Its smarts are built on Capybara-3B-V1.9 based on StableLM-3B-4e1t. Capybara-3B-V1.9 achieves state-of-the-art performance when compared to models of similar size, even beating some 7B models.
Mathematical Language Models: A Survey.	a survey on LLMs' development on mathematical activities; includes papers and resources on LLM research on tasks including theorem proving and math word problem solving, as well as prompting approaches.
A Survey of Large Language Models in Medicine: Principles, Applications, and Challenges.	a thorough analysis of more than 300 papers on LLMs in medicine; provides a synopsis of the concepts, uses, and difficulties faced by LLMs in the field.

Perspectives

Link	description
Excuse me, but the industries AI is disrupting are not lucrative.	Although the demo video for Google's Gemini AI model was remarkable, the model's real-time capabilities were questioned due to the use of pre-recorded material and edited responses, which resulted in a modest gain in Google's stock price. This caution is reflective of larger worries in the AI sector, as businesses set high standards but struggle to convert AI skills into meaningful economic gains. Current AI models are particularly good in domains that don't always generate large profits.
Interoperable Authentication Protocol.	Considering how quickly model capabilities are developing, coordinating communication between language models and users is essential. In order to address this, the Interoperable Authorization Protocol (IAP) establishes a consent management system and secure, flexible communication routes. In order to match AI operations with a variety of human values and objectives, this open-source methodology promotes cooperation within the AI community.
On Platform Shifts and AI.	While AI represents a technological shift, the discussion at the 2022 TCV Engage Summit focused on the fact that it requires a new distribution channel, which is critical for generating meaningful consumer prospects. Presently available AI breakthroughs are dependent on conventional channels of distribution, which benefits well-established businesses or creative start-ups; nevertheless, new channels of distribution may not materialize as expected.
The AI revolution in chemistry is not that far away.	Although the artificial intelligence (AI) revolution in chemistry has yet to happen, it is not that far off. The key question is what we can do to get there faster.
Can AI deliver advice that is judgment-free for science policy?	We acknowledge the potential of using artificial intelligence (AI) to inform science policy, but disagree with the suggestion that it can create judgment-free policy advice
How to make data open? Stop overlooking librarians.	Digital archivists are already experts at tackling the complex challenges of making research data open and accessible. We can help to smooth the transition.
Vertical AI.	The previous ten years of SaaS have been characterized by horizontal software, or software created to alleviate a user's pain regardless of the kind of user experiencing it. In contrast, vertical SaaS is intended for a specific user base, making it possible to customize the solution to meet their demands. According to a Greylock partner's prediction in this article, going the future, software development will be guided by consumers' expectations for customized services.
Why the AI Act was so hard to pass.	Over two years since it was first proposed, policymakers in Brussels were still debating core contents of the EU’s landmark AI regulations hours before reaching a deal.
Google’s Gemini Marketing Trick.	The world saw a jaw-dropping demo of Gemini this week. It just wasn’t the real deal.
Why Stability AI is launching a subscription fee.	Stability AI will charge commercial customers for the use of its most advanced models, pivoting away from being fully open source
The real research behind the wild rumors about OpenAI’s Q* project.	OpenAI hasn't said what Q* is but it has revealed plenty of clues.
Two Titans on the Future of AI (with Reid Hoffman & Vinod Khosla).	A double header from Cerebral Valley.

Back to index

ML news: Week 4 - 10 December

Research

Link	description
Diffusion Models Without Attention.	Modern diffusion models employ the attention mechanism in most cases, but not always. Recent advances in theory have accelerated the proliferation of interest in state spaces, leading to intriguing new applications.
MoMask: Generative Masked Modeling of 3D Human Motions.	A new initiative by the authors of seminal work in this field combines innovative encoder techniques to provide fine-grained control over the production of the final animation.
When StyleGAN Meets Stable Diffusion.	Enhancing identity preservation in produced pictures, a novel technique uses the enlarged StyleGAN embedding space W+ for text-to-image diffusion models.
MaXTron: Mask Transformer with Trajectory Attention for Video Panoptic Segmentation.	MaXTron is a simple yet effective unified meta-architecture for video segmentation, which enriches existing clip-level segmenters by introducing a within-clip tracking module and a cross-clip tracking module, thus achieving better temporally consistent segmentation results.
Mamba: Linear-Time Sequence Modeling with Selective State Spaces.	An additional article on state spaces that offers better performance and scalability. Here, they train a 3B parameter model, which beats out bigger 7B parameter Transformer models, by taking inspiration from the LSTM. official code.
MotionEditor: Editing Video Motion via Content-Aware Diffusion.	MotionEditor is a diffusion model that expertly strikes a balance between preserving the original material and manipulating motion in videos. With the introduction of a novel two-branch architecture with attention injection and a content-aware motion adaptor, modified movements may be seamlessly integrated while preserving the protagonist's look and the original background. official code.
Exploiting Diffusion Prior for Generalizable Pixel-Level Semantic Prediction.	Artificial intelligence-generated images now have more accurate semantic predictions because of a new technique called Diffusion Models as Prior (DMP). Even with minimal training data, this novel method outperforms existing approaches by shrewdly adapting pre-trained text-to-image models for a variety of tasks, such as semantic segmentation and 3D property estimation. official code.
IMMA: Immunizing text-to-image Models against Malicious Adaptation.	A novel way to prevent malicious adaptations of text-to-image models to produce harmful content is provided by the new IMMA approach. official code.
Language model self-teaching for domain adaptation.	You may either employ some retrieval strategies or fine-tune language models when attempting to apply them in areas that need knowledge of certain niches. Each has shortcomings. This innovative approach makes use of artificial data that is self-generated to improve knowledge throughout testing. In comparison to both fine-tuning and RAG, it demonstrates notable gains on typical adaption standards.
DiffiT: Diffusion Vision Transformers for Image Generation.	The Diffusion Vision Transformers (DiffiT) is a project that investigates the efficacy of vision transformers in diffusion-based generative learning. This model combines a new time-dependent self-attention module with a U-shaped encoder-decoder design.
Zero123++: A Single Image to Consistent Multi-view Diffusion Base Model.	This project introduces Zero123++, a model that applies diffusion concepts to produce consistent multi-view pictures from a single input image. Zero123++ tackles problems like as alignment and texture quality by using pre-trained 2D models.
Salient Object Detection in RGB-D Videos (RDVS dataset and DCTNet+ model).	The RDVS dataset, which contains a wide variety of RGB-D video sequences, and DCTNet+, a specialized network for RGB-D video object detection, which is outfitted with cutting-edge features for accurate prediction and enhanced performance over previous models, are the two main contributions revealed by this repository.
Style Aligned Image Generation via Shared Attention.	Amazing work by Google based on SDXL that distributes focus between generations to preserve unified looks. Importantly, this approach doesn't need to be adjusted.
Describing Differences in Image Sets with Natural Language.	This essay compares and contrasts two image collections using natural language. This is a brand-new, difficult challenge. The individual photos are captioned, rearranged, and summarized using a language model as part of the solution. official code.
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models.	In this work, BenchLMM, a benchmark for testing the resilience of large multimodal models such as GPT-4V and LLaVA against different image styles, is presented.
Let's Think Outside the Box !	This paper presents an approach to investigate Leap-of-Thought capabilities in multimodal LLMs using the Oogiri comedy-generating game. This method forces LLMs to use non-sequential thinking, which is an essential ability for coming up with original and amusing answers to a variety of multimodal material.
OneLLM: One Framework to Align All Modalities with Language.	OneLLM employs a universal encoder and a universal projection module to align multimodal inputs with LLM. It also utilizes modality tokens {modal} to switch between modalities.
Kandinsky 3.0.	We present Kandinsky 3.0, a large-scale text-to-image generation model based on latent diffusion, continuing the series of text-to-image Kandinsky models and reflecting our progress to achieve higher quality and realism of image generation. Compared to previous versions of Kandinsky 2.x, Kandinsky 3.0 leverages a two times larger UNet backbone, and a ten times larger text encoder and removes diffusion mapping. official code.

News

Link	description
OpenAI’s GPT store delayed to next year.	OpenAI’s GPT store will be delayed until next year, the company said in an email to people who signed up for its GPT Builder.
Amazon's Q generative AI chatbot allegedly leaks location of AWS data centers - report.	Amazon's newly launched artificial intelligence chatbot Amazon Q is “experiencing severe hallucinations and leaking confidential data,” internal documents warn.
Interview: Sam Altman on being fired and rehired by OpenAI.	“I totally get why people want an answer right now. But I also think it’s totally unreasonable to expect it.” In the meanwhile, OpenAI Agreed to Buy $51 Million of AI Chips From a Startup Backed by CEO Sam Altman and other details on the story emerge: New report illuminates why OpenAI board said Altman “was not consistently candid”.
Report: Stability AI Positioning Itself for Acquisition.	Stability AI, a British artificial intelligence (AI) startup, is reportedly considering selling the company amidst mounting pressure from investors over its financial position.
Report: Google delays Gemini launch from next week to January.	Google announced Gemini at I/O 2023 as its next-generation foundation model. According to a report today, Google was originally going to launch Gemini next week, but that has now been delayed until January.
The GPT to rule them all: Training for one trillion parameter model backed by Intel and US government has just begun.	LLM playfully dubbed 'ScienceGPT' is being trained from data from the Aurora supercomputer
Perplexity AI unveils ‘online’ LLMs that could dethrone Google Search.	Confusing AI can overthrow Google with its blend of current knowledge, a conversational AI chatbot interface, and a web index. Versions of the open-source models from Mistral and Meta that have been improved and enhanced have been made available by the firm. The models are meant to provide useful, accurate, and current information. These are the first-ever live LLM APIs with no knowledge cutoff, and they are based on online search data.
AI Alliance Launches as an International Community for Safe and Open AI .	More than 50 international organizations come together under the leadership of IBM and Meta to promote transparent, ethical AI development. Its main objectives are to advance hardware, establish AI standards, and advance AI knowledge and expertise. Major IT companies, academic organizations, and research centers are among the members. The Alliance places a strong emphasis on diversity, safety, and fair access to AI innovation.
Elon Musk's AI firm xAI files to raise up to $1 billion in equity offering.	Elon Musk's artificial intelligence startup xAI has filed with the U.S. securities regulator to raise up to $1 billion in an equity offering, according to a filing on Tuesday.
Google’s new AI experiment composes abstract musical clips inspired by instruments.	You may not hear the exact sound of the instrument you entered. you can try here.
Solve Intelligence helps attorneys draft patents for IP analysis and generation.	A platform that is AI-native and helps quickly create excellent patents is called Solve Intelligence. More than 25 IP firms worldwide have been employing them since their launch in July, and customers have reported 60–90% increases in efficiency. After completing Y Combinator, the startup just revealed the details of its $3 million seed financing.
Airbnb has acquired GamePlanner.AI.	Airbnb has purchased GamePlanner, an AI startup.AI, a business headed by Siamak Hodjat, an AI specialist, and Adam Cheyer, a co-founder of Siri. The group's main goals will be to integrate their technologies with Airbnb's platform and expedite certain AI projects. GamePlanner and Airbnb.AI are dedicated to leveraging AI to improve human-computer interaction.
Introducing Gemini: our largest and most capable AI model.	The field-wise network is a column-specific neural network to capture the information associated with the feature. The second component, on the other hand, is across-field network to choose the specialized operations for the dataset. Finally, the last step in the operation fusion network nonlinearly connects the chosen operations. In other words, it is an enhancement of DeepFM in which the network selects the best operations. Not all was good: Google’s best Gemini demo was faked.
Meta's AI image generator is available as a standalone website.	The company is testing dozens of new AI features in Facebook, Instagram and WhatsApp.
Microsoft’s Copilot is getting OpenAI’s latest models and a new code interpreter.	GPT-4 Turbo is on the way soon, alongside improvements to the DALL-E 3 model and deep search results for Bing.
Elon Musk told OpenAI to move faster right before he left the company in 2018: NYT.	Elon Musk said OpenAI needed to be quicker with its work before departing from the company, per NYT.
Purple Llama: Towards open trust and safety in the new world of generative AI.	Aiming to provide fair and equitable conditions for creating responsible and safe generative AI experiences, Purple Llama is a new initiative that is releasing models, assessments, and tools under permissive licenses for both commercial and research use. The Llama Guard model for identifying and thwarting cyberattacks, CyberSecEval for evaluating AI system security, and tools for insecure code detection and cyberattack compliance testing are among the initial products. By democratizing access to necessary resources, this project enables developers to design safe and ethical AI experiences.
X begins rolling out Grok, its ‘rebellious’ chatbot, to subscribers.	Grok, a ChatGPT competitor developed by xAI, Elon Musk’s AI startup, has officially launched on X, the site formerly known as Twitter.
Enabling next-generation AI workloads: Announcing TPU v5p and AI Hypercomputer.	Google unveiled Cloud TPU v5p, the company's most potent, adaptable, and scalable AI accelerator to yet. AI-powered products are trained and served using TPUs. Additionally, Google has revealed the AI Hypercomputer from Google Cloud, a supercomputer architecture that makes use of an integrated system of top-tier machine learning frameworks, open software, performance-optimized hardware, and adaptable consumption options. Systems-level co-design is used by AI Hypercomputer to increase productivity and efficiency in AI serving, tweaking, and training.
Long context prompting for Claude 2.1.	In reference, Anthropic's last release of Claude had 200k tokens. It only demonstrated 27% retrieval performance on certain common tasks, suggesting that it was afflicted by the "lost in the middle" issue that plagues language models in external evaluations. Retrieval performance rises to 98% when the prompt is modified to include the statement "Assistant: Here is the most relevant sentence in the context."

Resources

Link	description
weight-selection.	We introduce weight selection, a method for initializing models by selecting a subset of weights from a pre-trained larger model. With no extra cost, it is effective for improving the accuracy of a smaller model and reducing its training time needed to reach a certain accuracy level.
Nous-Hermes-2-Vision - Mistral 7B.	This vision model is a potent new open-source text and vision model that can operate on consumer hardware, and it is built on top of the best 7B language model with SigLIP integration. The incorporation of function calling is one of the neat ideas here. Due to a hallucination problem, the model remains in alpha.
LLM As A Function.	It's helpful to think about language models as functions with standard input and output when adding them to your code base. The author of React Native demonstrates a couple of methods for doing that in this blog post, along with the advantages of modeling your models in this manner.
aiconfig.	AIConfig saves prompts, models, and model parameters as source control-friendly configs. This allows you to iterate on prompts and model parameters separately from your application code.
Microsoft's Generative AI for Beginners.	A 12 Lesson course teaching everything you need to know to start building Generative AI applications
Unsloth.	Fast and memory-efficient LLM tuning.
Responsible Innovation Labs Launches First Pro-Innovation Responsible AI Protocol For Startups.	A voluntary industry-driven approach for responsible AI has been introduced by nonprofit Responsible Innovation Labs (RIL), especially for startups and their investors. The protocol is intended to make sure startups incorporate ethical AI practices as they grow, and it has the backing of 35 top venture capital funds. This action is in line with the Biden Administration's ethical AI goal.
mlx.	Thanks to unified memory, Apple has discreetly introduced a new Array framework that runs faster on Mac devices. It offers some GPU support in addition to being straightforward and tidy.
stable-fast.	An ultra-lightweight inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.
Introducing the OpenAI Switch Kit: Move from closed to open-source AI in minutes.	With just a few lines of code, you can make your project open source.
Optimizing LLMs for Real-World Applications.	Lightspeed provides information from TitanML and Google regarding the specifics of enhancing LLMs through fine-tuning or prompting.
Efficient SAM Example.	This script provides an example of how to get visualization results from EfficientSAM using ready-to-use torch script
CopilotKit.	Integrate AI-powered Textareas and in-app chatbots into React web applications.
Mamba-Chat.	Mamba-Chat is the first chat language model based on a state-space model architecture, not a transformer. The model is based on Albert Gu's and Tri Dao's work Mamba: Linear-Time Sequence Modeling with Selective State Spaces
mlx-llama.	People have already enabled Llama 2 models to operate on the new framework, just one day after Apple published the MLX framework. or if you prefer Mistral.
UIDraw.	Draw and build a website on your phone.
SEO GPT by Writesonic.	Boost your website’s SEO instantly right inside ChatGPT.

Perspectives

Link	description
The future of AI in software development.	Inbal Shani, the Chief Product Officer of GitHub, talks about the role AI plays in software development and makes the case that AI-driven code production will increase developer productivity rather than replace it. She delves into GitHub's Copilot's performance measures, philosophy, and innovation-promoting practices. The future of AI in the IT sector is clarified by this discussion.
OpenAI & Grand Strategy.	The importance of grand strategy in the IT industry is emphasized in this essay, which compares the lofty goals of contemporary tech executives to historical victories and exhorts them to behave and think like historical leaders in order to match capabilities with ambitions. It uses the recent happenings at OpenAI as an illustration of a grand strategy that works and challenges prospective leaders to develop plans that match their skills with their goals for meaningful, constructive change.
AI and Trust.	In the discussion, the speaker emphasizes the differences between social and interpersonal trust and warns about the potential for profit-driven organizations to take advantage of our inclination to view AI as friends rather than as a service. In order to guarantee that AI continues to be a reliable and advantageous service for society, they urge government intervention through transparency legislation and regulations targeted at the people who create AI. They also ask for the creation of public AI models.
AI Doomers are worse than wrong - they're incompetent.	This essay attacks the OpenAI AI doomer movement for making calculated mistakes and unnecessarily speeding up AI development instead of protecting it.
How AI Changes Workflows.	GitHub has acknowledged the impact of AI on developer workflows and is "re-founding" Copilot. With AI-assisted procedures like autocomplete code, it will boost productivity and might potentially completely reorganize operations. This change permits customized workflows, but it also necessitates striking a compromise between flexibility and the capacity to offer extensive client assistance.
Teach Llamas to Talk: Recent Progress in Instruction Tuning.	After instruction tweaking was implemented, the utility of language models significantly increased. Synthetic data pipelines are one of the numerous recent innovations that improve and streamline the process.
Two Titans on the Future of AI (with Reid Hoffman & Vinod Khosla).	An excellent synopsis of Reid Hoffman and Vinod Khosla's 45+ minute lectures, in which they each delved into a range of subjects from AI to manifestos for "techno-optimists." Both titans have insightful opinions about the future and the best ways to manage the tech sector.
Is AI leading to a reproducibility crisis in science?	Scientists worry that ill-informed use of artificial intelligence is driving a deluge of unreliable or useless research.
Generative AI could revolutionize health care — but not if control is ceded to big tech.	Large language models such as that used by ChatGPT could soon become essential tools for diagnosing and treating patients. To protect people’s privacy and safety, medical professionals, not commercial interests, must drive their development and deployment.
ChatGPT one year on: who is using it, how and why?.	In just a year, ChatGPT has permeated scientific research. Seven scientists reveal what they have learned about how the chatbot should — and shouldn’t — be used.
ML system design: 300 case studies to learn from.	How do companies like Netflix, Airbnb, and Doordash apply machine learning to improve their products and processes? We put together a database of 300 case studies from 80+ companies that share practical ML use cases and learnings from designing ML systems.

Back to index

ML news: Week 27 November - 3 December

Research

Link	description
SegVol: Universal and Interactive Volumetric Medical Image Segmentation.	Clinical analysis has entered a new era with the release of SegVol, a universal model for medical picture segmentation. SegVol is highly proficient at segmenting a wide range of anatomical categories, having been trained on a large set of CT images. official code.
Visual In-Context Prompting.	This novel strategy supports a variety of cues and environments, significantly improving performance in segmentation tasks and demonstrating outstanding outcomes in open-ended challenges. official code.
Starling-7B: Increasing LLM Helpfulness & Harmlessness with RLAIF.	Researchers at Berkeley used synthetic preference data to train a brand-new, cutting-edge 7B parameter model. This blog discusses the unique difficulties in training reward models (such as how an example's score might change depending on where it is in the list) and how they overcome them. Both the training reward model and the generated model are made public.
Segmentation-Based Parametric Painting.	A novel method has been devised by researchers to convert pictures into paintings that emulate human characteristics and aesthetics.
Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers.	A groundbreaking study explores the potential of shallow feed-forward networks to replace attention mechanisms in Transformer models. Shallow feed-forward networks can emulate the behavior of attention mechanisms effectively, with similar performances. This research opens new avenues in neural network design, potentially simplifying complex models.
MEDITRON-70B: Scaling Medical Pretraining for Large Language Models.	Large language models (LLMs) can potentially democratize access to medical knowledge. In this work, we improve access to large-scale medical LLMs by releasing MEDITRON: a suite of open-source LLMs with 7B and 70B parameters adapted to the medical domain.
DiffSLVA: Harnessing Diffusion Models for Sign Language Video Anonymization.	A novel technique for maintaining linguistic content in sign language films while maintaining anonymity is DiffSLVA. By eliminating the need for exact position prediction, this method, which makes use of pre-trained diffusion models and a dedicated module for face expressions, addresses earlier shortcomings.
Efficient Dataset Distillation via Minimax Diffusion.	Generative diffusion techniques have been used in a novel way to produce surrogate datasets that are much less computationally intensive and far more representative and varied.
Enhancing Perceptual Quality in Video Super-Resolution through Temporally-Consistent Detail Synthesis using Diffusion Models.	A new method for video super-resolution (VSR) called StableVSR uses a temporal conditioning module together with diffusion models to improve the quality of upscaled movies.
Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling.	This research suggests "Animatable Gaussians," a cutting-edge method that blends 3D Gaussian splatting with 2D CNNs to produce more realistic and intricate human avatars from films.
Illuminating protein space with a programmable generative model.	Evolution has produced a range of diverse proteins, and now a generative model called Chroma can expand that set by allowing the user to design new proteins and protein complexes with desired properties and functions.
Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection.	UVCOM is a new framework that better handles the distinct requirements of Video Moment Retrieval (MR) and Highlight Detection (HD).
DeepSeek-LLM.	The Deepseek coder model demonstrated remarkable code synthesis capability. Its current chat LLM, with 67B parameters, works noticeably better than Llama 2's 70B.
llamafile.	llamafile lets you distribute and run LLMs with a single file. Introducing llamafile.
Millions of new materials discovered with deep learning.	AI tool GNoME finds 2.2 million new crystals, including 380,000 stable materials that could power future technologies
Seamless: Multilingual Expressive and Streaming Speech Translation.	An innovative technique that transforms automated voice recognition is called Seamless. Conversations feel more natural thanks to this sophisticated model, which not only translates across 76 languages but also maintains the speaker's distinct prosody and voice style.
Language-conditioned Detection Transformer.	A novel open-vocabulary detection system called DECOLA is presented by researchers. It performs exceptionally well at recognizing things outside of its training dataset. official code.
Diffusion-MU-Attack.	This project proposes an evaluation system to assess the reliability of safety-driven unlearning techniques in diffusion models through the use of adversarial prompts.
AI system self-organises to develop features of brains of complex organisms.	Cambridge scientists have shown that placing physical constraints on an artificially-intelligent system – in much the same way that the human brain has to develop and operate within physical and biological constraints – allows it to develop features of the brains of complex organisms in order to solve tasks.

News

Link	description
Anthropic slashes AI pricing amid rising competition.	For tokens created after the most recent version of Claude was released, Anthropic added a pricing reduction. Both closed and open model pressure are the source of this. In addition, the model now can digest up to 200k tokens, hallucinates half as often, and can search the web, more info here.
Gen AI for the Genome: LLM Predicts Characteristics of COVID Variants.	A new demo lets users explore visualizations of the genome-scale language model by Argonne National Laboratory, NVIDIA, and other collaborators.
Codegen raises new cash to automate software engineering tasks.	Codegen has successfully raised a significant amount of money for some truly incredible automated software development technology. It links GitHub PRs for automated engineering to Jira boards.
Kyutai is a French AI research lab with a $330 million budget that will make everything open source.	This new lab, called Kyutai, will be a privately funded nonprofit working on artificial general intelligence. It will work with PhD students, postdocs, and researchers on research papers and open-source projects.
Amazon and Salesforce Expand Partnership to Add New AI Capabilities.	Salesforce and AWS have extended their collaboration to facilitate customers' management of data on both platforms and their integration of generative AI technology into their workflows and apps.
Tesla starts releasing to employees FSD v12 – a critical update to self-driving effort.	Tesla has started releasing to employees its FSD v12 update, which is apparently critical to Tesla’s achieving its self-driving goal. The biggest difference with the update is how vehicle controls will be taken over by neural nets rather than being hard-coded by engineers.
ChatGPT with voice is now available to all free users.	Download the app on your phone and tap the headphones icon to start a conversation.
Announcing the MLCommons AlgoPerf Training Algorithms Benchmark Competition.	A recent competition called AlgoPerf attempts to optimize for wall clock time. This implies that you can earn real money if you can develop a technique that outperforms current settings (faster than ADAM, for example). Some of the biggest AI companies in the world today support this fascinating task.
Introducing SDXL Turbo: A Real-Time Text-to-Image Generation Model.	SDXL Turbo achieves state-of-the-art performance with a new distillation technology, enabling single-step image generation with unprecedented quality, reducing the required step count from 50 to just one. code and weights.
OpenAI unlikely to offer board seat to Microsoft, other investors - source.	ChatGPT owner OpenAI is not expected to offer Microsoft and other investors including Khosla Ventures and Thrive Capital seats on its new board, a person familiar with the matter told Reuters on Tuesday. OpenAI has a new initial board. OpenAI seems to have returned to normality, however Some OpenAI employees are still feeling uneasy and looking for other jobs despite Sam Altman's return, report says.
Sports Illustrated Published Articles by Fake, AI-Generated Writers.	Articles created by fictitious AI authors have been covertly published by Sports Illustrated.
Together AI raises $102M series A.	This investment, led by Kleiner Perkins, will support the expansion of the training and inference solutions, which have had tremendous uptake since Together's June debut.
Amazon Q.	A generative model was trained by Amazon to help AWS platform users. Additionally, it will be used to general business support. The model is a proprietary mechanism that responds to inquiries from users on different facets of Amazon's backend infrastructure.
Voltage Park launches massive new cloud for AI development.	With a 24k H100 mega cluster, Voltage Park is a new cloud service that powers processes for clients like Character AI. With less than $2 per hour for each GPU, it appears to have an industry-leading price.
These ex-Apple employees are bringing AI to the desktop.	After selling Workflow to Apple in 2017, the co-founders are back with a new startup that wants to reimagine how desktop computers work using generative AI.
Martian’s tool automatically switches between LLMs to reduce costs.	With $9 million in funding, AI experts Etan Ginsberg and Shriyash Upadhyay founded the startup Martian. Martian developed a "model router" tool that intelligently routes requests to the most appropriate model for the job, maximizing the utilization of huge language AI models like GPT-4 and reducing costs. According to the founders, this strategy encourages fundamental research as opposed to competitive research.

Resources

Link	description
ziplora-pytorch.	Low-rank learning matrices, or LoRAs, alter model behavior at a lower cost than traditional fine-tuning. This paper suggests a practical method for combining LoRAs while preserving their unique information. original paper
DuckTrack: Accurate Computer Activity Tracking.	It can be a little difficult to extract image, audio, and keystroke data from your computer. This library's goal is to train digital agents by simplifying that procedure.
direct-preference-optimization.	Using extremely identical data, direct preference optimization is a reliable substitute for RLHF. An implementation of the approach can be studied in this repository to gain knowledge about it.
Kandinsky Video — a new text-to-video generation model.	This paper presents a new two-stage latent diffusion text-to-video generation architecture based on the text-to-image diffusion model. code.
SD-T2I-360PanoImage.	A novel circular blending technique has been developed by researchers to solve the enduring problem of producing smooth 360-degree panoramic pictures. Their novel methods for creating panoramic panoramas from text and individual photos heavily rely on this methodology.
insanely-fast-whisper.	Transcribe 150 minutes (2.5 hours) of audio in less than 98 seconds.
Agency: The Go Way to AI.	Library designed for developers eager to explore the potential of Large Language Models (LLMs) and other generative AI through a clean, effective, and Go-idiomatic approach.
CoachLM.	CoachLM presents a cutting-edge AI method for improving training datasets for LLMs. This approach dramatically increases the efficacy of instruction-following in LLMs by improving datasets in a novel way—by modifying rather than eliminating low-quality samples.
multimodal-maestro.	Effective prompting for Large Multimodal Models like GPT-4 Vision or LLaVA.
Tanuki.	Easily build LLM-powered apps that get cheaper and faster over time.
Accelerating Generative AI with PyTorch II: GPT, Fast.	The PyTorch team describes in this blog post how to significantly accelerate language model inference using native Pytorch code. The article explains how to obtain more than 200 tokens from Llama 2 7B every second.
Qwen-Audio.	An audio understanding model that is universal has been released by the Alibaba Cloud Group. It is capable of a variety of audio-related tasks, such as text question answering and music comprehension and speaker recognition.
3D to Photo.	3D to Photo is an open-source package by Dabble, that combines threeJS and Stable diffusion to build a virtual photo studio for product photography. Load a 3D model into the browser and virtual shoot it in any kind of scene you can imagine. The app currently uses Stable Diffusion 1.5-inpainting, hosted on Replicate.
5 Ways To Leverage AI in Tech with Freshworks CIO Prasad Ramakrishnan.	Prasad Ramakrishnan, CIO of Freshworks, goes over a few practical applications of AI for startups. This article outlines five ways that businesses may utilize AI to grow and address challenges, from improving user experience to onboarding and optimizing data platforms.

Perspectives

Link	description
The Q* hypothesis: Tree-of-thoughts reasoning, process reward models, and supercharging synthetic data.	According to a recent leak, an internal breakthrough combining search and reinforcement learning was the reason behind the OpenAI leadership scandal. This article presents one notion that sheds light on what this new approach could be genuinely doing.
Inside OpenAI, a rift between billionaires and altruistic researchers unravelled over the future of artificial intelligence.	a comprehensive overview of the OpenAI disaster. Generations to come will study this. This synopsis, which covers all of the intriguing facets and consequences, is essential reading for anyone who hasn't followed the full story.
What is OpenAI, Really?.	It’s been five incredibly turbulent days at the leading AI tech company, with the exit and then return of CEO Sam Altman. As we dig into what went wrong, an even bigger question looms: what is OpenAI?
Why I Just Resigned From My Job In Generative AI.	Stability AI's Audio team leader left because of differences in the company's position on using copyrighted material to train generative AI models. The proponent of generative AI feels that it is not fair use to train models on copyrighted material without permission, as this could put them in competition with original creations.
Exploring the Growing Convergence Between Blockchain and AI.	This research, which surveyed more than 600 IT professionals worldwide, delves into the main causes of the increasing demand for AI, the most well-liked applications, and the main obstacles preventing its wider deployment. Interestingly, it refutes the idea that blockchain and artificial intelligence are incompatible with technology.
Reshaping the tree: rebuilding organizations for AI.	By automating tasks and decision-making, the integration of AI in enterprises is altering conventional work processes and empowering teams to operate more productively. Organizations must adjust as AI develops quickly by promoting team experimentation with the technology, getting ready for new developments, and moving quickly to maintain their competitive edge.
God Help Us, Let's Try To Understand AI Monosemanticity.	By replicating a larger AI within a smaller one, Anthropic researchers have devised a mechanism to comprehend the intricate workings of AI. What they have discovered is that AI neural networks are capable of encoding information in a sophisticated way, much like physics' superposition. Through the application of this method to a basic AI, they found that it could represent discrete concepts, such as "God," in discrete features. They further conjecture that the same methodology may yield deeper insights into artificial and biological neural systems, which could result in safer and more effective AI development.
A Data-Driven Look at the Rise of AI.	This article from the Cerebral Valley AI Summit examines the development of AI using data and is heavily illustrated with slides. We all hear it all the time, but there is a ton of evidence to support it. The development of developer interest and its eventual collapse are of particular importance.
How Much Does it Cost to Use an LLM?.	Different models cost different amounts. Also, the size of the context window is an important factor. But how much?
The 10-Year “Overnight” Success Story of Casetext.	When Casetext first launched in 2013, it was a crowdsourced legal library, similar to "Wikipedia meets Reddit" for legal matters. A decade later, Casetext stands as one of the greatest AI achievements to date, able to compress weeks' worth of laborious legal work into hours or minutes. It was purchased for $650 million just a few months ago. Between those two points, what transpired?
ChatGPT generates fake data set to support scientific hypothesis.	Researchers say that the model behind the chatbot fabricated a convincing bogus database, but a forensic examination shows it doesn’t pass for authentic.
What the OpenAI drama means for AI progress — and safety.	A debacle at the company that built ChatGPT highlights concern that commercial forces are acting against the responsible development of artificial-intelligence systems.
Adjust the format of papers to improve description by AI.	The chatbot ChatGPT and other tools based on large language models (LLMs) can make scientific research more efficient, but they can also introduce mistakes when they describe scientific work. I suggest that small changes to the format of scientific papers could improve the training of LLMs.
AI under the microscope: the algorithms powering the search for cells.	Deep learning is driving the rapid evolution of algorithms that can automatically find and trace cells in a wide range of microscopy experiments.
ChatGPT's training data can be exposed via a "divergence attack".	Large language models, like ChatGPT, have been shown to be able to memorize and unintentionally disclose particular training data, raising privacy concerns—especially with bigger models.
In The Age Of AI, Google Experiments With Bold Changes To Search.	The excitement surrounding Q*, the purported AI breakthrough from OpenAI, illustrates the tech community's propensity to quickly change focus and conjecture about the next great development in AI, frequently with no information—a phenomenon known as "Shiny Object Syndrome."
A global hit: AI translation tools help singers break down borders.	While some producers view the cost as prohibitive, YouTube, Mr. Beast, and a South Korean label are among the companies using AI to dub video content into many languages.
ChatGPT is winning the future — but what future is that?.	OpenAI didn’t mean to kickstart a generational shift in the technology industry. But it did. Now all we have to decide is where to go from here.

Back to index

ML news: Week 20-26 November

Research

Link	description
MILA: Memory-Based Instance-Level Adaptation for Cross-Domain Object Detection.	Cross-domain object detection is challenging, and it involves aligning labeled source and unlabeled target domains. we propose a memory-based instance-level domain adaptation framework. Our method aligns a target instance with the most similar source instance of the same category retrieved from memory storage. official code.
TopoMLP: An Simple yet Strong Pipeline for Driving Topology Reasoning.	TopoMLP is a system that detects and analyzes traffic features and road centerlines to comprehend road scenes and identify drivable courses for self-driving automobiles. official code.
Fantastic Gains and Where to Find Them: On the Existence and Prospect of General Knowledge Transfer between Any Pretrained Model.	In this study, several data optimization strategies that need less computational overhead to enable knowledge transfer across models are examined.
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models.	StyleTTS 2 is a text-to-speech model that combines huge speech language models with adversarial training and style diffusion to produce human-level voice synthesis. official code.
Orca 2: Teaching Small Language Models How to Reason.	A few months ago, we introduced Orca, a 13-billion language model that demonstrated strong reasoning abilities by imitating the step-by-step reasoning traces of more capable LLMs.Orca 2 significantly surpasses models of similar size (including the original Orca model) and attains performance levels similar to or better than models 5-10 times larger, as assessed on complex tasks that test advanced reasoning abilities in zero-shot settings.
Proving Test Set Contamination in Black Box Language Models.	a thorough examination of the data that was utilized to train language models. Its findings imply that a large number of closed-source models most likely did not train on widely used benchmarks.
Amazon Reportedly Training AI With Twice As Many Parameters As GPT-4 .	The model will have a whopping 2 trillion parameters, which are the variables that determine the output of a given model, making it one of the largest currently in development.

News

Link	description
Discord is shutting down its AI chatbot Clyde.	Discord users won’t be able to chat to Clyde from December 1st onwards.
OpenAI has put ChatGPT Plus sign-ups on pause.	After announcing premium-tier users can build their own chatbots, CEO Sam Altman says its Plus subscription has exceeded capacity
OpenAI Staff Threatens Exodus, Jeopardizing Company’s Future.	A board member who was part of Sam Altman’s ouster as chief executive joined a majority of the company’s staff in calling for the decision’s reversal.
Sam Altman is still trying to return as OpenAI CEO.	Altman’s move to Microsoft isn’t a done deal, and Ilya Sutskever’s flip to supporting Altman means two board members need to change their minds.
Salesforce looks to poach outbound OpenAI staff with "full cash" compensation offer.	OpenAI researchers leaving the firm in protest could be offered a lifeline at Salesforce
Amazon’s offering free courses on generative AI.	From the company that brought you AWS certification comes a new ‘AI Ready’ education track to help train aspiring professionals on Amazon’s AI tech.
Eye On AI: Bain Capital Ventures Launches BCV Labs In Search Of New AI Deals.	BCV Labs is a new AI incubator and technical community founded by Bain Capital Ventures that provides money, office space, events, GPU credits, fellowship program, and recruiting help.
Microsoft rebrands its AI-powered Bing Chat as Copilot.	The company has also announced more Copilot AI features for its 365 apps.
Sam Altman to return as CEO of OpenAI.	After an attempted coup by OpenAI’s board that lasted five days, Altman is returning alongside co-founder Greg Brockman.
Microsoft and Nvidia are making it easier to run AI models on Windows.	Microsoft’s new Windows AI Studio lets developers access and configure AI models, such as Microsoft’s Phi, Meta’s Llama 2, and Mistral.
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding.	Automating autoregressive language model inference may be done in a variety of ways. One method that has generated excitement is the use of draft models. Although it may take longer, this needs two models. On the other hand, you may reduce the requirement for a draft model and accelerate creation linearly by producing related ngrams from the same model.
OpenAI drops a big new ChatGPT feature with a joke about its CEO drama.	ChatGPT’s voice feature lets you ask it a question by saying it aloud — and now it’s available for free.
Emmett Shear threatening to leave OpenAI if board can’t prove Sam Altman’s wrongdoing.	Former Twitch CEO Emmett Shear took a role at OpenAI following the ousting of Sam Altman but is reportedly threatening to leave unless the board can show evidence of Altman’s wrongdoing.
Artificial intelligence finds ways to develop new drugs.	A new AI model developed by chemists at ETH Zurich can not only predict where a pharmaceutically active molecule can be chemically modified but also how best to do it. This makes it possible to identify new pharmaceutical ingredients more quickly and improve existing ones in a targeted manner.
OpenAI researchers warned board of AI breakthrough ahead of CEO ouster, sources say.	Ahead of OpenAI CEO Sam Altman’s four days in exile, several staff researchers wrote a letter to the board of directors warning of a powerful artificial intelligence discovery that they said could threaten humanity

Resources

Link	description
Neural-Cherche.	Neural-Cherche is a library designed to fine-tune neural search models such as Splade, ColBERT, and SparseEmbed on a specific dataset.
The Data Engineering Handbook.	This repo has all the resources you need to become an amazing data engineer.
tensorli.	Absolute minimalistic implementation of a GPT-like transformer using only numpy (<650 lines).
THE RISE OF “WET” ARTIFICIAL INTELLIGENCE.	Combining AI with traditional wet lab work creates a virtuous circle from lab to data and back to the lab.
Video-LLaVA.	Video-LLaVA exhibits remarkable interactive capabilities between images and videos, despite the absence of image-video pairs in the dataset. It achieves state-of-the-art performance in video summarization and captioning.
make-real-starter.	Recently, tldraw released a popular tool that lets people quickly design software using a paint-like interface. GPT-V is then used to write code for the design's online version. It produces reliable and functional code and operates remarkably well. It also accepts commands in plain language.
AI Exploits.	A collection of real-world AI/ML exploits for responsibly disclosed vulnerabilities
Collaborative Word-based Pre-trained Item Representation for Transferable Recommendation.	The recently proposed CoWPiRec method enhances recommender systems using text-based item representations combined with collaborative filtering information. Using word graphs for item interactions, this novel approach has demonstrated better performance in a range of recommendation circumstances, including solving the cold-start issue.
RustGPT.	A web ChatGPT clone entirely crafted using Rust and HTMX.
Stable Video Diffusion Image-to-Video Model Card.	Stable Video Diffusion (SVD) Image-to-Video is a diffusion model that takes in a still image as a conditioning frame, and generates a video from it.
LangChain for Go.	Building applications with LLMs through composability, with Go
Reinforcement Learning for Generative AI: A Survey.	Comprehensive review across various application areas like NLP, computer vision, and more exciting and emerging domains. Insights into RL's flexibility in introducing new training approaches.Future directions for the evolution of generative AI.

Perspectives

Link	description
OpenAI’s identity crisis and the battle for AI’s future.	Last weekend some news happened in OpenAI, this blog post is about discussing some open questions.
A Data-Driven Look at the Rise of AI.	2023, The AI Revolution: Coatue's Sri Viswanath breaks down this year's developments in AI.
AI: The Coming Revolution.	Coatue highlight four points for the future: AI has potential to break through the hype and meaningfully improve our world. Open source is the heartbeat of AI, but not all open source is created equally. Builders and investors need to understand the new, AI-centric tech stack. The best of AI is yet to come
OpenAI’s Misalignment and Microsoft’s Gain.	After co-founders Sam Altman and Greg Brockman resigned from OpenAI due to internal issues and the company's failing non-profit strategy, Microsoft acquired key staff and intellectual property from OpenAI, significantly changing the AI field.
AGI's Impact on Tech, SaaS Valuations.	Thought experiments on how AGI affects SaaS companies of all shapes and sizes
Oops! We Automated Bullshit.	ChatGPT is a bullshit generator. To understand AI, we should think harder about bullshit
Explaining the SDXL latent space.	Using a smaller latent space for diffusion was one of the advances of the original Stable Diffusion model. This indicates that the diffusion occurs on a compressed image representation rather than on pixels. This article explores many interpretations of that space for SDXL.
Sudden Disturbances in Rapidly Moving Objects : The Implications of the OpenAI Fiasco.	The unexpected threat to OpenAI's dominating position in the developer ecosystem creates a chance for smaller businesses to step in and take advantage of a fresh opening. Microsoft will probably emerge victorious in the AI race, but it's possible that Anthropic and other model-layer businesses may capitalize on the disruption.
AI should focus on equity in pandemic preparedness.	Over-reliance on AI could inadvertently prioritize certain viruses or populations, leading to inequities in vaccine and disease research.
How AI is expanding art history.	From identifying disputed artworks to reconstructing lost masterpieces, artificial intelligence is enriching how we interpret our cultural heritage.
How AI shapes the life sciences: an interview with Oliver Stegle.	Oliver Stegle explains how AI-based tools have the potential to transform our ability to better understand the complexity of life and how these tools will shape the future of scientific exploration

Back to index

ML news: Week 12-19 November

Research

Link	description
3DStyle-Diffusion: Pursuing Fine-grained Text-driven 3D Stylization with 2D Diffusion Models.	In order to provide more control over appearance and geometry, this research integrates 2D diffusion models into the 3DStyle-Diffusion model, a revolutionary technique for comprehensive stylization of 3D meshes. It functions by first employing implicit MLP networks to parameterize the texture of a 3D mesh into reflectance and illumination. After that, a pre-trained 2D diffusion model is used to maintain geometric consistency and match the produced pictures with the text prompt. official code.
Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Task.	Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism to enhances pre-trained audio-visual models for multi-modal tasks.
Generalized Biomolecular Modeling and Design with RoseTTAFold All-Atom.	RoseTTAFold All-Atom (RFAA), a deep network addressing the limitations of current protein structure modeling tools by accurately representing complete biological assemblies, including covalent modifications and interactions with small molecules. RFAA demonstrates comparable accuracy to AlphaFold2 in protein structure prediction, excels in flexible small molecule docking, and predicts covalent modifications and assemblies involving nucleic acids and small molecules. Additionally, the authors present RFdiffusion All-Atom (RFdiffusionAA), a fine-tuned model for generating binding pockets around small and non-protein molecules, showcasing experimental validation with proteins binding to therapeutic, enzymatic, and optically active molecules.
FinGPT: Large Generative Models for a Small Language.	This study tackles the challenges of creating large language models (LLMs) for Finnish, a language spoken by less than 0.1% of the world population.
Watermarking Vision-Language Pre-trained Models for Multi-modal Embedding as a Service.	VLPMarker, a secure and robust backdoor-based embedding watermarking method for vision-language pre-trained models (VLPs), which effectively injects triggers into VLPs without interfering with model parameters, providing high-quality copyright verification and minimal impact on performance, while also enhancing resilience against various attacks through a collaborative copyright verification strategy based on both backdoor triggers and embedding distribution.
Visualizing the Diversity of Representations Learned by Bayesian Neural Networks.	ExplainableAI methods and their applications to Bayesian Neural Networks
MonoDiffusion: Self-Supervised Monocular Depth Estimation Using Diffusion Model.	In this work, a novel framework for self-supervised monocular depth estimation called MonoDiffusion is presented. It takes a fresh approach to the problem by treating iterative denoising. Instead of employing real depth ground-truth for training, it makes use of a faux ground-truth diffusion process led by a teacher model that has already been taught. official code.
Knowledgeable Preference Alignment for LLMs in Domain-specific Question Answering.	The paper discusses the deployment challenges of large language models (LLMs) in real-world scenarios, particularly in domain-specific question answering (QA) with the integration of domain knowledge graphs. The authors introduce KnowPAT, a novel pipeline that employs style and knowledge preference sets, coupled with a new alignment objective, to improve LLMs for practical use in domain-specific QA, as evidenced by superior performance in experiments against 15 baseline methods. official code.
DeepMind AI accurately forecasts weather — on a desktop computer.	The machine-learning model takes less than a minute to predict future weather worldwide more precisely than other approaches. original article
Role play with large language models.	Casting dialogue-agent behaviour in terms of role play allows us to draw on familiar folk psychological terms, without ascribing human characteristics to language models that they in fact lack.
Fine-tuning Language Models for Factuality.	ChatGPT's widespread acceptance was made possible by a breakthrough in model optimization based on preferences. By using comparable technologies, model accuracy and factual accuracy can be increased, leading to a 50% reduction in medical recall errors.
Ultra-Efficient On-Device Object Detection on AI-Integrated Smart Glasses with TinyissimoYOLO.	This group trained an ultra-small YOLO computer vision model and developed new RISC-V hardware specifically for vision, allowing for real-time object identification at very low latency and low power consumption.
SentAlign: Accurate and Scalable Sentence Alignment.	an accurate sentence alignment tool designed to handle very large parallel document pairs. It can efficiently handle tens of thousands of sentences. official code.
Large Language Models are Temporal and Causal Reasoners for Video Question Answering.	LLMs make errors in VQA when they focus too much on the language and ignore the video content, this article aims to solve this official code.

News

Link	description
Google in talks to invest ‘hundreds of millions’ into AI startup Character.AI.	Character.AI's chatbots, with various roles and tones to choose from, have appealed to users ages 18 to 24, who contributed about 60% of its website traffic.
Introducing AI to FigJam.	FigJam, Figma's digital whiteboard application, now incorporates AI support to help streamline and improve design interactions.
LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills.	An open-source approach that integrates language and vision is called LLaVa. The updated version gives the instruction-tuned model access to tools for creating and altering images, among other things.
ai-town-rwkv-proxy.	Hundreds of agents in AI Town, an incredible experiment, go about their everyday lives as prompt states in language models. Compared to typical Transformers, the RWKV model is a linear language model that uses less resources. This repository runs AI Town on your local computer using this less expensive model.
Nvidia is launching a new must-have AI chip — as customers still scramble for its last one.	The new class-leading H200 has more memory capacity and bandwidth, speeding up its work with generative AI and LLMs.
OpenAI reveals new details about its AI development roadmap and fundraising plans.	OpenAI LP is working on GPT-5 and plans to raise more capital from Microsoft Corp. to support its development efforts, Chief Executive Officer Sam Altman has disclosed in a new interview.
Xbox partners with Inworld AI to build generative AI tools for game development.	Xbox and Inworld AI are working together to build AI-driven technologies that will enhance game developers' narratives and character creation features. As part of the collaboration, an AI character runtime engine and an AI design copilot will be created to help game creators create immersive gaming experiences. They believe these technologies will accelerate game creation, improve immersion, and encourage boundless innovation.
New techniques efficiently accelerate sparse tensors for massive AI models.	Complimentary approaches — “HighLight” and “Tailors and Swiftiles” — could boost the performance of demanding machine-learning tasks.
OpenAI’s six-member board will decide ‘when we’ve attained AGI’	According to OpenAI, the six members of its nonprofit board of directors will determine when the company has “attained AGI”
Giant AI Platform Introduces ‘Bounties’ for Deepfakes of Real People.	Users of the contentious "bounties" function of Civitai, an AI model sharing site, may now commission and profit from the production of AI-generated photographs.
You.com launches new APIs to connect LLMs to the web.	When OpenAI connected ChatGPT to the internet, it supercharged the AI chatbot’s capabilities. Now, the search engine You.com wants to do the same for every large language model (LLM) out there.
Microsoft and OpenAI partnership unveils new AI opportunities.	Microsoft said at OpenAI's DevDay that it will launch the new GPT-4 Turbo on Azure OpenAI Service before year's end, offering more control and cost savings. Businesses' AI skills will be enhanced by OpenAI's Custom Models initiative, which will easily interact with Microsoft's ecosystem.
Nous-Capybara-34B V1.9.	This is trained on the Yi-34B model with 200K context length, for 3 epochs on the Capybara dataset (multi-turn data with more than 1000 tokens per conversation)
AI writes summaries of preprints in bioRxiv trial.	Large language model creates synopses of papers aimed at various reading levels to help scientists sift through the literature.
Catch me if you can! How to beat GPT-4 with a 13B model.	Announcing Llama-rephraser: 13B models reaching GPT-4 performance in major benchmark. What's the trick behind it? Well, rephrasing the test set is all you need!
IBM debuts $500 million enterprise AI venture fund	IBM is dedicating $500 million to invest in generative AI startups focused on business customers.
Microsoft is finally making custom chips — and they’re all about AI.	The Azure Maia 100 and Cobalt 100 chips are the first two custom silicon chips designed by Microsoft for its cloud infrastructure
Google's AI-powered search feature goes global with a 120-country expansion.	The SGE update includes additional language support for Spanish, Portuguese, Korean and Indonesian.
Universe 2023: Copilot transforms GitHub into the AI-powered developer platform.	GitHub is announcing general availability of GitHub Copilot Chat and previews of the new GitHub Copilot Enterprise offering, new AI-powered security features, and the GitHub Copilot Partner Program.
Deepmind’s animation gallery.	A variety of animations and artwork have been made available by Google's deepmind research department to help people comprehend various AI systems. The animations are visually stunning but also a little strange.
Deep mind announce music generation model.	Today, in partnership with YouTube, we’re announcing Google DeepMind’s Lyria, our most advanced AI music generation model to date. Any content published by our Lyria model will be watermarked with SynthID.
META introduces Emu Video and Emu Edit, our latest generative AI research milestones.	A generative model frequently produces an output image that isn't exactly what you were hoping for. It is really difficult to alter that image using the same model, though. Meta made a crucial discovery: editing capabilities can arise when all generations are treated as instructions. This is a really good improvement, especially when combined with the model architecture's newfound simplicity.
Microsoft launches a deepfakes creator at Ignite 2023 event.	One of the more unexpected products to launch out of the Microsoft Ignite 2023 event is a tool that can create a photorealistic avatar of a person and animate that avatar saying things that the person didn’t necessarily say.
YouTube will show labels on videos that use AI	YouTube is now requiring creators to mark videos that are made using AI, and the platform will show labels to viewers.
Sam Altman fired as CEO of OpenAI.	In a sudden move, Altman is leaving after the company’s board determined that he ‘was not consistently candid in his communications.’ President and co-founder Greg Brockman has also quit. apparently, they asked him to come back but he is now hired by Microsoft.
Google delays launch of AI model Gemini, a potential rival to OpenAI's GPT-4.	Google is delaying the launch of its new large language model called Gemini, a potential rival to AI models from Microsoft (MSFT)-backed OpenAI
The Escalating AI Arm Race: Inside the High-Stakes Talent Wars with OpenAI and Google.	OpenAI recruiters are pitching annual compensation packages around $5-10 million for senior researchers who jump ship depending on their role and expertise.
Meta disbanded its Responsible AI team.	A new report says Meta’s Responsible AI team is now working on other AI teams.

Resources

Link	description
The Alignment Handbook.	The HuggingFace's Alignment Handbook aims to fill that gap by providing the community with a series of robust training recipes that span the whole pipeline.
versatile_audio_super_resolution.	Pass your audio in, AudioSR will make it high fidelity!
tarsier.	Vision utilities for web interaction agents. A number of teams are working on creating agents that can interact with web items through vision thanks to the development of potent new vision models. A standard toolset is introduced by Tarsier (e.g., element tagging). Any vision system will work to help you navigate the website and take action. It also has browsing facilities for language models without eyesight.
Extra-fast Bark for generating long texts.	In this notebook, we'll show you how to generate very long texts very quickly using Bark, Flash Attention 2 and batching.
OpenGPTs.	This is an open source effort to create a similar experience to OpenAI's GPTs. It builds upon LangChain, LangServe and LangSmith.
Tamil-Llama: A Family of LLaMA-based LLMs focused on Tamil Language.	This repository contains the code and models for "Tamil-Llama", a project focused on enhancing the performance of language models for the Tamil language.
GPT4V-AD-Exploration.	In our report, we explore the revolutionary GPT-4V, a visionary in the field of autonomous driving.
BestGPTs.	Top ranked OpenAI GPTs
Hallucination Leaderboard.	This evaluates how often an LLM introduces hallucinations when summarizing a document.
draw-a-ui	This is an app that uses tldraw and the gpt-4-vision api to generate html based on a wireframe you draw.
AMBER: An Automated Multi-dimensional Benchmark for Multi-modal Hallucination Evaluation.	a new benchmark designed to assess and reduce hallucinations in Multi-modal Large Language Models (MLLMs)
instructor	Structured extraction in Python, powered by OpenAI's function calling api, designed for simplicity, transparency, and control.
GPU-Accelerated LLM on a $100 Orange Pi.	This post shows GPU-accelerated LLM running smoothly on an embedded device at a reasonable speed. Additionally, we are able to run a Llama-2 13b model at 1.5 tok/sec on a 16GB version of the Orange Pi 5+ under $150.
LLM Sherpa.	LLM Sherpa provides strategic APIs to accelerate large language model (LLM) use cases.
The Developer's Guide to Production-Grade LLM Apps.	dvanced Techniques for Maximizing LLM Performance
Accelerating Generative AI with PyTorch: Segment Anything, Fast.	This blog shows how to get META SAM 8x faster, just using PyTorch features: quantization, nested tensors and Triton
ai-exploits.	This repository, ai-exploits, is a collection of exploits and scanning templates for responsibly disclosed vulnerabilities affecting machine learning tools.
Music ControlNet.	ControlNet represented an innovative approach to providing image synthetics models with fine-grained control. There is now a model for music generation that is fairly similar and allows you to manage several aspects such as pitch and pronunciation.
GPT-4 Turbo Note Taker.	Fast and simple, Tactiq’s AI Note Taker with GPT-4 Turbo lets you turn your meetings into actionable notes - so that you're always taking the right action and getting more out of your meetings.
Chroma.	Chroma is a generative model for designing proteins programmatically.
A Survey on Language Models for Code.	gives a summary of LLMs for code, covering 500 relevant works, more than 30 assessment tasks, and more than 50 models.

Perspectives

Link	description
Adversarial Attacks on LLMs.	This blog post discusses the many new assaults that language model systems are facing. It has good details regarding several attack types as well as some successful mitigations that teams have discovered.
AI and Open Source in 2023.	A comprehensive review of the major developments in the AI research, industry, and open-source space that happened in 2023.
How investors see your start up?	A general partner at Angular Ventures divides the application concepts we are seeing into three major categories in an attempt to make sense of all the nascent AI firms. This exclusively examines application-layer businesses; it ignores model-layer companies.
retool's state of AI 2023.	Retool surveyed 1,500 tech workers
Language models can use steganography to hide their reasoning, study finds.	large language models (LLMs) can master “encoded reasoning,” a form of steganography. This intriguing phenomenon allows LLMs to subtly embed intermediate reasoning steps within their generated text in a way that is undecipherable to human readers.
Why teachers should explore ChatGPT’s potential — despite the risks.	Many students now use AI chatbots to help with their assignments. Educators need to study how to include these tools in teaching and learning — and minimize pitfalls.
The future is quantum: universities look to train engineers for an emerging industry.	With quantum technologies heading for the mainstream, undergraduate courses are preparing the workforce of the future.
The Future of Music: How Generative AI Is Transforming the Music Industry.	AI-generated music has the potential to become our primary source of music in the future and influence our listening preferences. This might mark the beginning of music's "Midjourney moment."
AI Doomers Are Finally Getting Some Long Overdue Blowback.	Now, those who predicted AI will bring about our collective extinction must reconsider their claims. The "AI doom" really mainly benefited the large players, and there are plenty of chances for the open source AI movements.
There's a model for democratizing AI.	The request for recommendations made by OpenAI on integrating democratic procedures in AI decision-making comes out as constrictive and prefers to handle delicate political matters without accepting accountability, which could limit the application and efficacy of democracy in AI governance.
Copilot is an Incumbent Business Model.	Though its ultimate disruptive potential rests in redesigning workflows, a challenge that might open substantially larger market opportunities, the Copilot AI business model improves current workflows for efficiency without generating new markets or upending lower ends.

Back to index

ML news: Week 6-12 November

Research

Link	description
RT-Sketch: Goal-Conditioned Imitation Learning from Hand-Drawn Sketches.	Hand-drawn sketches as a modality for goal specification in visual imitation learning. You sketch the robot execute, in other words, you can communicate with the robot with a sketch. here is the official article.
Cheating Depth: Enhancing 3D Surface Anomaly Detection via Depth Simulation	RGB-based surface anomaly detection methods have advanced significantly. However, certain surface anomalies remain practically invisible in RGB alone, necessitating the incorporation of 3D information. This new approach 3D data with RGB outperforms traditional methods for surface anomaly detection. official code.
Gaussian Mixture Solvers for Diffusion Models.	Recently, diffusion models have achieved great success in generative tasks. Gaussian mixture solvers improve the model both in speed and quality official code.
PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis.	This paper introduces PIXART-alpha, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators. The model uses three elements: T5 text encodings, cross attention, and a diffusion transformer
Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch.	In this paper, we uncover that Language Models (LMs), either encoder- or decoder-based, can obtain new capabilities by assimilating the parameters of homologous models without retraining or GPUs. official code.
An Efficient Self-Supervised Cross-View Training For Sentence Embedding	Cross-View Training (SCT) allows efficient sentence embedding with small language models official code.
A Systematic Review of Deep Graph Neural Networks: Challenges, Classification, Architectures, Applications & Potential Utility in Bioinformatics	Apart from presenting all existing GNN models, mathematical analysis and comparison of the variants of all types of GNN have been highlighted in this survey. Graph neural networks are investigated for their potential real-world applications in various fields, focusing on Bioinformatics.
How AI could lead to a better understanding of the brain	Early machine-learning systems were inspired by neural networks — now AI might allow neuroscientists to get to grips with the brain’s unique complexities.
How AI can help to save endangered species	Scientists are using artificial intelligence to fight biodiversity loss by analyzing vast amounts of data, monitoring ecosystems, and spotting trends over time.
Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models	An article from Google providing experimental evidence that the transformer (and therefore LLMs) cannot generalize beyond the training data. This is an indication that the transformer will be not the architecture leading us to artificial general intelligence (AGI)
RobustMat: Neural Diffusion for Street Landmark Patch Matching under Challenging Environments	For autonomous vehicles (AVs), visual perception techniques based on sensors like cameras play crucial roles in information acquisition and processing. In various computer perception tasks for AVs, it may be helpful to match landmark patches taken by an onboard camera with other landmark patches captured at a different time or saved in a street scene image database. The authors using spatial information and neural differential equations have created an approach to improve landmark matching. official code
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models.	Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity, and spatio-temporal continuity. This new approach is composed of two steps: preserve the static image's content and refine details and resolution.
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples.	We know that better data improves LLM training, here is a better way to clean your data. The authors have published the decontaminator tool here.
Hallucination in LLMs.	We begin with an innovative taxonomy of LLM hallucinations, then delve into the factors contributing to hallucinations. Subsequently, we present a comprehensive overview of hallucination detection methods and benchmarks
Simplifying Transformer Blocks.	Combining signal propagation theory and empirical observations, we motivate modifications that allow many block components to be removed with no loss of training speed, including skip connections, projection or value parameters, sequential sub-blocks, and normalization layers. official code
LLaVA-Med: Large Language and Vision Assistant for BioMedicine	LLaVA-Med was initialized with the general-domain LLaVA and then continuously trained in a curriculum learning fashion (first biomedical concept alignment then full-blown instruction-tuning). We evaluated LLaVA-Med on standard visual conversation and question-answering tasks. official repository

News

Link	description
Google Research scholar program.	The Research Scholar Program aims to support early-career professors who are pursuing research in fields relevant to Google.
OpenAI DevDay Buzz Includes Alleged Leak Of New ChatGPT Prototype.	"highlights: OpenAI could introduce major updates for developers, making it cheaper and faster to build AI-based applications. A rumored "Team" plan for ChatGPT could offer unlimited high-speed GPT-4, advanced data analysis, and more."
Google is extending its Vulnerability Rewards Program (VRP) to include generative AI.	Today, we’re expanding our VRP to reward attack scenarios specific to generative AI. As part of expanding VRP for AI, we’re taking a fresh look at how bugs should be categorized and reported.
Paper Digest: NeurIPS 2023 Highlights.	Paper digest has analyzed more than 500/3500 papers. Interesting, but many articles are already been published for a while
HelixNet.	HelixNet is a Deep Learning architecture consisting of 3 x Mistral-7B LLMs. It has an actor, a critic, and a regenerator. The authors also used AI synthetic data. This approach showed impressive results. The model is available on HuggingFace
ChatGPT Plus members can upload and analyze files in the latest beta.	ChatGPT Plus members can also use modes like Browse with Bing without manually switching, letting the chatbot decide when to use them.
OpenAI Dev day recap.	A recap by OpenAI:New GPT-4 Turbo model that is more capable, cheaper and supports a 128K context window, New Assistants API that makes it easier for developers to build their own assistive AI apps. New multimodal capabilities in the platform, including vision, image creation (DALL·E 3), and text-to-speech (TTS)
xAI PromptIDE	Integrated development environment for prompt engineering and interpretability research, released by xAI
ChatGPT continues to be one of the fastest-growing services ever	In less than a year, it’s hit 100 million weekly users, and over 2 million developers are currently building on the company’s API, including the majority of Fortune 500 companies.
Xbox partners with Inworld AI to build AI tools for game development.	Microsoft’s Xbox and Inworld AI have partnered to create AI-powered game development tools for narrative and character creation.
Nvidia Is Piloting a Generative AI for Its Engineers.	ChipNeMo summarizes bug reports, gives advice, and writes design-tool scripts
YouTube to test generative AI features.	Users may test out a new conversational tool that utilizes artificial intelligence (AI) to respond to inquiries about YouTube content and provide suggestions, as well as a new feature that summarizes subjects in video comments, as part of the premium package offered to pay subscribers.
Google Announces Expansion of AI Partnership with Anthropic.	Partnership includes important new collaborations on AI safety standards, committing to the highest standards of AI security, and use of TPU v5e accelerators for AI inference
Cohere Introduced Embed v3	Embed v3 offers state-of-the-art performance per trusted MTEB and BEIR benchmarks. it is multilingual (100+ languages), works well with noisy data, retrieval-augmentation generation (RAG) systems, searches in a language or cross-language searches
Microsoft has over a million paying Github Copilot users	"We have over 1 million paid copilot users in more than 37,000 organizations that subscribe to copilot for business," said Nadella, "with significant traction outside the United States."
Meta's audiocraft can also generate stereo music	Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor/tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.
Hugging Face has a two-person team developing ChatGPT-like AI models	Hugging Face's H4 team is focused on developing open-source ChatGPT
Samsung is joining the AI arms race, too	Samsung’s live translate feature, which the company is calling “AI Live Translate Call,” will be built into the company’s native phone app. Samsung says “audio and text translations will appear in real-time as you speak” and that the translations will happen on the device.
Introducing Adept Experiments	Adept is building AI agents and now they are opening access to test them
Introducing GPTs	You can now create custom versions of ChatGPT that combine instructions, extra knowledge, and any combination of skills. Highlight: Starting today, you can create GPTs and share them publicly. Later this month, we’re launching the GPT Store, featuring creations by verified builders. Once in the store, GPTs become searchable and may climb the leaderboards. We will also spotlight the most useful and delightful GPTs we come across in categories like productivity, education, and “just for fun”. In the coming months, you’ll also be able to earn money based on how many people are using your GPT.
Google Cloud demonstrates the world’s largest distributed training job for large language models across 50000+ TPU v5e chips	Google Cloud TPU Multislice Training was built from the ground up to address the challenges of distributed ML training in orchestration, compilation, and end-to-end optimization. We demonstrated the benefits of Cloud TPU Multislice Training with what we believe is the largest publicly disclosed LLM distributed training job in the world (in terms of the number of chips used for training) on a compute cluster of 50,944 Cloud TPU v5e chips on the JAX ML framework, utilizing both BF16 and INT8 quantized training.
OpenAI Data Partnerships.	We’re interested in large-scale datasets that reflect human society and that are not already easily accessible online to the public today. We can work with any modality, including text, images, audio, or video. We’re particularly looking for data that expresses human intention (e.g. long-form writing or conversations rather than disconnected snippets), across any language, topic, and format.

Resources

Link	description
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference.	new software for competing with vLLM and text-generation interfaces for the fast serving of language models.
qdrant.	Qdrant (read: quadrant) is a vector similarity search engine and vector database. It provides a production-ready service with a convenient API to store, search, and manage points—vectors with an additional payload Qdrant is tailored to extended filtering support.
Video2Music	Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model. official article.
Hacking Google Bard - From Prompt Injection to Data Exfiltration.	Great post that explains what are the novel risk with generative AI plugins
RedPajama-Data-v2.	a new version of the RedPajama dataset, with 30 trillion filtered and deduplicated tokens (100+ trillion raw) from 84 CommonCrawl dumps covering 5 languages, along with 40+ pre-computed data quality annotations that can be used for further filtering and weighting. A dataset bigger than the one used for GPT-4 and already preprocessed
LLM4Rec.	The proposed CLLM4Rec is the first recommender system that tightly combines the ID-based paradigm and LLM-based paradigm and leverages the advantages of both worlds.
consistencydecoder.	OpenAI has released an Improved decoding for stable diffusion vaes. Consistency decoder has reached the SOTA and it is nice they released also for stable diffusion
TopicGPT.	We introduce TopicGPT, a prompt-based framework that uses large language models (LLMs) to uncover latent topics within a provided text collection. TopicGPT produces topics that align better with human categorizations compared to competing methods. official article.
FACTOR.	an effective tool to detect deep fakes even without training. FACTOR leverages the discrepancy between false facts and their imperfect synthesis within deepfakes. By quantifying the similarity using the truth score, computed via cosine similarity, FACTOR effectively distinguishes between real and fake media, enabling robust detection of zero-day deepfake attacks.
CogVLM.	CogVLM is a powerful open-source visual language model (VLM). CogVLM-17B has 10 billion vision parameters and 7 billion language parameters.
langroid	Langroid is an intuitive, lightweight, extensible, and principled Python framework to easily build LLM-powered applications. You set up Agents, equip them with optional components (LLM, vector-store and methods), assign them tasks, and have them collaboratively solve a problem by exchanging messages.
OVIR-3D.	D object retrieval from text prompts using 2D image fusion. his work provides a straightforward yet effective solution for open-vocabulary 3D instance retrieval, which returns a ranked set of 3D instance segments given a 3D point cloud reconstructed from an RGB-D video and a language query.
JaSPICE: Automatic Evaluation Metric Using Predicate-Argument Structures for Image Captioning Models.	an automatic evaluation metric called JaSPICE, which evaluates Japanese captions based on scene graphs. There is a gap between performance of models for english captioning and other languages, this clever approach promises to reduce the gap
awesome-openai-vision-api-experiments.	A set of examples showing how to use the OpenAI vision API to run inference on images, video files and webcam streams.
punica.	Low rank adapation (LoRA) is a parameter efficient way to add new knowledge to a pretrained LLM. Although the pretrained LLM takes 100s of GB storage, a LoRA finetuned model only adds 1% storage and memory overhead. Punica enables running multiple LoRA finetuned models at the cost of running one.
LongQLoRA.	LongQLoRA is a memory-efficient and effective method to extend context length of Large Language Models with less training GPUs. On a single 32GB V100 GPU, LongQLoRA can extend the context length of LLaMA2 7B and 13B from 4096 to 8192 and even to 12k.
Lidar-Annotation-is-All-You-Need.	a smarter method for self-driving cars to recognize roads by using lidar technology.
LM4VisualEncoding.	Pretrained transformers from LLMs, despite being trained solely on textual data, are surprisingly strong encoders for purely visual tasks in the absence of language. Our exploration shows the potential of LLMs as general-purpose encoders for visual data, as opposed to the previous usages of either pure encoders for text embeddings or decoders for tokenized outputs. official article.
vimGPT.	Browse the web with GPT-4V and Vimium. Vimium is a Chrome extension that lets you navigate the web with only your keyboard. You could use Vimium to give the model a way to interact with the web.
Announcing a New Way to Create AI Employees	the first platform letting you build a team of AI employees working together to perform any task. The idea is to build an agent that you can call and ask to perform a task

Perspectives

Link	description
Data Pipeline Attacks.	An excerpt from Secure Intelligent Machines. In the future attacks will be focused on poisoning data or other components of the data pipeline. This blog post describes this issue and potential mitigation issues
Could Cruise be the Theranos of AI? And is there a dark secret at the core of the entire driverless car industry?	Cruise is a driverless car company bought by General Motors. However, it seems that remote human interventions is needed in many cases
Will generative AI transform business?	Industries expect demand for quality control and human oversight of AI-generated content to grow
A minor ChatGPT update is a warning to founders: Big Tech can blow up your startup at any time.	Wrapping chatGPT as a core business is not a great idea. chatGPT can now interact with PDF and let you ask questions which is blowing the business of small start-ups. It's a bleak reminder that swift rule changes by Big Tech firms can wreak havoc on smaller players.
Pixel Perfect: How AI Unlocks Creativity.	AI, and creators are gaining momentum. Using the right tactics can increase it
Almost an Agent: What GPTs can do.	GPT is almost an agent, but what actually an agent can do? For instance, write a scientific article by itself
Are language models good at making predictions?	It seems so. The article suggests GPT-4 really is better at making predictions for politics than for science or technology, even once the hardness of the questions are accounted for.
OpenAI Is A Lot More Vulnerable Than You Think.	All the press, money, and awards in the world won’t prevent OpenAI from the cold reality of competition.
ChatGPT use shows that the grant-application system is broken.	The fact that artificial intelligence can do much of the work makes a mockery of the process. It’s time to make it easier for scientists to ask for research funding.
The world’s week on AI safety: powerful computing efforts launched to boost research.	UK and US governments establish efforts to democratize access to supercomputers that will aid studies on AI systems.
Is AI the Next Crypto? Insights from 2M HN comments.	Both crypto and AI have been heavily debated on Hacker News, with discussions going back years. By looking at trends in HN commenter opinions we might find interesting similarities and differences.
AI companies have all kinds of arguments against paying for copyrighted content.	The biggest companies in AI aren’t interested in paying to use copyrighted material as training data.
AI could cause ‘catastrophic’ financial crisis, says Yuval Noah Harari	Historian and Sapiens author says sophistication of technology makes it difficult to forecast its dangers
Nvidia Envy: understanding the GPU gold rush.	In 2023, thousands of companies and countries begged Nvidia to purchase more GPUs. Can the exponential demand endure?
AI is about to completely change how you use computers.	Bill Gates in his blog (yes, he has a blog) discuss how AI will revolutionize software interaction
Self Supervised Learning Market Size Thrives with AI Systems That Discover Patterns and Insights Independently	Self Supervised Learning market growth surges due to AI's ability to autonomously learn from unlabelled data, enhancing efficiency and innovation
Yoko Taro Foresees the End of Video Games as We Know Them	Yoko Taro says the rise of AI will give birth to a new era of video games in which the line between developer and player is blurred into nonexistence.
How Generative AI Will Transform Knowledge Work	Generative AI can be a boon for knowledge work, but only if you use it in the right way. New generative AI-enabled tools are rapidly emerging to assist and transform knowledge work in industries ranging from education and finance to law and medicine.

Back to index

ML news: Week 30 October - 5 November

Research

Link	description
An Emulator for Fine-Tuning Large Language Models using Small Language Models	What would happen if we combined the knowledge learned by a large model during pre-training with the knowledge learned by a small model during fine-tuning (or vice versa)? Our experiments with EFT show that scaling up fine-tuning tends to improve helpfulness while scaling up pre-training tends to improve factuality.
Nearest Neighbor Guidance for Out-of-Distribution Detection	Detecting out-of-distribution (OOD) or unfamiliar data samples is crucial for machine learning models deployed in open-world environments. NNguide can help the model in this setting, especially in identifying unknown data. Code for the benchmark,Code for the method
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning	Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer.
AlphaFold update	AlphaFold’s update by Isomorphic (a spin-off from Google). A more powerful model that expands coverage beyond proteins. Other related information: Comment by DeepMind, official article
Mask Propagation for Efficient Video Semantic Segmentation	a method for segmenting video content that reduces computational load by focusing on keyframes and then predicting masks
Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4V	how well GPT-4 with Vision (GPT-4V) answers questions related to medical images? This study analyzes exactly this offcial code
Learning From Mistakes Makes LLM Better Reasoner	Consider a human student who failed to solve a math problem, he will learn from what mistake he has made and how to correct it. Mimicking this error-driven learning process, LeMa fine-tunes LLMs on mistake-correction data pairs generated by GPT-4. Analysis of the article
AI ‘breakthrough’: neural net has human-like ability to generalize	Systematic generalization is demonstrated by people’s ability to effortlessly use newly acquired words in new settings. official article
Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks	This article benchmarks different pre-trained models on different computer vision tasks official code, analysis of the article
The Foundation Model Transparency Index	Stanford measured how transparent companies are true their Large Language Models (LLMs) and other foundation models. The results? there is a lot to improve. deep dive
SoulChat: Improving LLMs' Empathy, Listening, and Comfort Abilities through Fine-tuning with Multi-turn Empathy Conversations	Researchers developed a new method to improve empathy capabilities of large language models. This is can be very important for psychological counseling or medical application official code
Towards Foundation Models for Knowledge Graph Reasoning	A foundation model for knowledge graphs which was actually missingblog post from the authors
Evaluating Large Language Models: A Comprehensive Survey	A comprehensive overview about the evaluation of LLMs
Deep Learning for Day Forecasts from Sparse Observations	a state-of-the-art neural weather model; MetNet-3 makes predictions up to 24 hours ahead for precipitation, wind, temperature, and dew point.

News

Link	description
Google commits to invest $2 billion in OpenAI competitor Anthropic	Google agreed to invest up to $2 billion in Anthropic, the artificial intelligence startup founded by ex-OpenAI executives, CNBC has confirmed.
Amazon rolls out AI-powered image generation	Amazon Ads has introduced an AI-powered image generation feature in beta. Without technical skills, brands can now create more engaging ads
Multi-modal prompt injection image attacks against GPT-4V	Multi-modal prompt injection image attacks against GPT-4V. GPT4-V is the new mode of GPT-4 that allows you to upload images as part of your conversations. It’s absolutely brilliant. It also provides a whole new set of vectors for prompt injection attacks.
Biden releases AI executive order directing agencies to develop safety guidelines	The executive order builds on non-binding agreements the White House made with AI companies.
A group behind Stable Diffusion wants to open source emotion-detecting AI	The group wants to open source the Empathic project. This in order to improve AI-human interaction
Kuo: Apple Could Spend $4.75 Billion on AI Servers in 2024	Apple is expected to spend several billion on hardware to support its artificial intelligence development in 2024. Tim Cook has commented that they are spending quite a bit of money on AI (more details here)
Artists Lose First Round of Copyright Infringement Case Against AI Art Generators	While a federal judge advanced an infringement claim against Stability AI, he dismissed the rest of the lawsuit.
Hackers Are Weaponizing AI to Improve a Favorite Attack	Phishing attacks are already devastatingly successful. What happens when artificial intelligence makes them even harder to spot?
Chinese tech giant Alibaba launches upgraded AI model to challenge Microsoft, Amazon	Alibaba on Tuesday launched the latest version of its artificial intelligence model (Tongyi Qianwen 2.0, its latest large language model), as the Chinese technology giant looks to compete with U.S. rivals like Amazon and Microsoft.
Microsoft pushes the boundaries of small AI models with big breakthrough	Microsoft researchers shared that the model, Phi 1.5, is now “multimodal,” meaning it can view and interpret images. Phi 1.5 is open source.
New techniques efficiently accelerate sparse tensors for massive AI models	Researchers from MIT and NVIDIA have developed two techniques that accelerate the processing of sparse tensors, a type of data structure that’s used for high-performance computing tasks. The complementary techniques could result in significant improvements to the performance and energy efficiency of systems like the massive machine-learning models that drive generative artificial intelligence.
Stability AI’s latest tool uses AI to generate 3D models	Stability AI, the startup behind the text-to-image AI model Stable Diffusion, thinks 3D model creation tools could be the next big thing in generative AI.
UK invests $273 million in AI supercomputer as it seeks to compete with U.S., China	The U.K. government said Wednesday that it will invest £225 million, or $273 million, into an AI supercomputer, highlighting the country’s ambition to lead in the technology as it races to catch up to the U.S. and China.
The Beatles Just Released Their Final Song With The Help Of AI	More than 50 years after their breakup, The Beatles have released their final song — and used AI to bring John Lennon's voice back to life.
Elon Musk's first AI product is a chatbot named Grok	Elon Musk's first AI product is here, and it's a chatbot called Grok — not to be confused with rizzed-up Baby Gronk.

Resources

Link	description
Audioflare	An all-in-one AI audio playground using Cloudflare AI Workers to transcribe, analyze, summarize, and translate any audio file.
JudgeLM: Fine-tuned Large Language Models are Scalable Judges	JudgeLM is an open platform for training, serving, and evaluating scalable large language model
Deep learning in Rust	Rust is a popular language and Burn is a framework for using ML in Rust. Now, you have a free book to learn about burn in rust.
LLM Collection	a collection and summary of notable and foundational LLMs
Leveraging Embeddings and Clustering Techniques in Computer Vision	How to use CLIP to cluster images
Training LLMs at Scale with AMD MI250 GPUs	Everyone uses NVIDIA, this post discusses how to train a LLM with AMD GPU
ICTC: Image Clustering Conditioned on Text Criteria	New methodology for performing image clustering based on user-specified criteria in the form of text paper
Insanely Fast Whisper	Transcribe 300 minutes (5 hours) of audio in less than 10 minutes - with OpenAI's Whisper Large v2.
magentic	Easily integrate Large Language Models into your Python code. Simply use the @prompt decorator to create functions that return structured output from the LLM. Mix LLM queries and function calling with regular Python code to create complex logic.
PUCA: Patch-Unshuffle and Channel Attention for Enhanced Self-Supervised Image Denoising	a new self-supervised denoising approach with incredible performances
LangChain Templates	LangChain Templates are the easiest and fastest way to build a production-ready LLM application. These templates serve as a set of reference architectures for a wide variety of popular LLM use cases.
how-to guide for LLaMA	META has released a guide on how to get started with LLaMA
Fine-tuning Mistral on your own data	In this notebook and tutorial, we will fine-tune the Mistral 7B model with just 1 dollar
Amazon release Mistral 7B with longer context window	Amazon has used RoPE to extend the model context length to 32K. However, there is already a Mistral version with 128K (by Nous using the Yarn method) which you can find here
Tiger toolkit	open-source resource for developers to create AI models and language applications tailored to their needs.
parameter-efficient-MOE	Cohere has released the code base for training an efficient mixture of experts (MOE)
ChatGPT-Powered Hierarchical Comparisons for Image Classification	Conventional image classification approaches typically evaluate their performance on the same set of categories as their training data. However, this evaluation paradigm fails to capture the challenges in real-world scenarios, where classes in the test set are not overlapped with the training set. For this reason here is a simple method using ChatGPT to create hierarchical classes official code
talk-llama	Talk with an LLaMA AI in your terminal
What's In My Big Data?	WIMBD platform analyzes content in text corpora, revealing duplicates, low-quality content, PII, toxicity, and benchmark contamination. code will be released here

Perspectives

Link	description
Thanks to AI, the future of programming may involve YELLING IN ALL CAPS	Politeness and emphasis play a surprising role in AI-model communications. Some OpenAI internal prompts are leaked, showing that using caps-lock for important words and adding please is a surprisingly efficient technique
Is AI alignment on track? Is it progressing... too fast?	We do not have concrete benchmarks about alignment, this is feeding a narrative of fear and doom. but it is true? Without serious study, we cannot know, this blog post discusses it in detail
The White House Is Preparing for an AI-Dominated Future	The Atlantic perspective on the new bill: "President Biden’s big swing on AI is as impressive and confusing as the technology itself."
The Half-Life of the AI Stack	The infrastructure layer in AI is rapidly changing
Ilya Sutskever, OpenAI’s chief scientist, on his hopes and fears for the future of AI	Interviewer to one of the most famous AI researcher
How Amazon and Berkshire got too big	a perspective about threats the business growth
Seismic Waves of Gen Z Behavior	A perspective on how Generation Z is changing the industries and the market.
Andrew NG warns big tech mount on AI fear to stop competition	A leading AI expert and Google Brain co-founder said Big Tech companies were stoking fears about the technology's risks to shut down competition. Yann LeCun is also discussing the same here
Biden’s AI Order May Have Wide Impact For Startups	The new order can have a deep impact for start-up
What AI means for your product strategy	1 hour podcast about how AI will impact product strategy
4 Ways AI Is Changing Marketing	How can AI be harnessed to drive more effective and efficient marketing? Forbes is discussing this
Sifting Through the Noise	We are in the age of information overload and soon we can be flooded with AI-generated content, how we survive?
How AI detectors can destroy innocent writers' livelihoods	The massive false positive rate of general AI detectors had a devastating effect on freelance writer Michael Berben: being falsely accused of cheating, he lost his job.

Back to index

ML news: Week 23-29 October

Research

Link	description
Geographical erasure in language generation	LLMs encode a vast amount of knowledge but it is not representative of all countries, Amazon shows how to mitigate this unbalance
Entangled Preferences: The History and Risks of Reinforcement Learning and Human Feedback	A deep dive in the history of RLHF, potential issues and suggestions for new lines of research
AgentTuning: Enabling Generalized Agent Abilities for LLMs	Open-source models are inferior as AI agents when you need them as efficient controllers for complex tasks. This paper highlights how to create efficient agent LLaMA-2 models
The Foundation Model Transparency Index	Stanford's new index rates the transparency of 10 foundation model companies and finds them lacking. The new index analyses 100 parameters, showing there is room for improvements
BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues	evaluation of the ability of large language models (LLMs) to engage in human-like multi-turn conversations.
SALMONN: Towards Generic Hearing Abilities for Large Language Models	SALMONN understands text and audio at the same time, and can be used for speech recognition and speech translation. official code
FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling	While you can generate easily an image with diffusion creating a video is much more complex (consistency), this work allows generations up to 512 frames long paper, code
PDFTriage: Question Answering over Long, Structured Documents	Finding information from PDFs (web pages or other multi-page structured documents) is more difficult than for regular text. Therefore researchers at Adobe Research have developed a model that is able to consider both the text and the structure of the document
VidChapters-7M: Video Chapters at Scale	Segmenting long videos into chapters enables users to quickly navigate to the information of their interest. Here the authors collected VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total.
RLMRec: Representation Learning with Large Language Models for Recommendation	In this article the authors enhanced a recommendation system with an LLM, resulting in better recommendations. code here
CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images	We assemble a dataset of Creative-Commons-licensed (CC) images, which we use to train a set of open diffusion models that are qualitatively competitive with Stable Diffusion 2 (SD2). official code
LLM-FP4: 4-Bit Floating-Point Quantized Transformers	We propose LLM-FP4 for quantizing both weights and activations in large language models (LLMs) down to 4-bit floating-point values, in a post-training manner. Existing post-training quantization (PTQ) solutions are primarily integer-based and struggle with bit widths below 8 bits.official code
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time	For a specific input, only a small fraction of attention heads and MLP neurons are needed, while the rest can be "silenced" without changing the output. Deja Vu to speed up inference for large language models. exploiting "contextual sparsity" (finding small subsets of model parameters that are sufficient to compute the same output for a given input.). This is unlike prior pruning methods that permanently remove parameters. official code
ConvNets Match Vision Transformers at Scale	Many researchers believe that ConvNets perform well on small or moderately sized datasets, but are not competitive with Vision Transformers when given access to datasets on the web scale. The authors invested the same computer budget on a CNN to make a fair comparison with the vision transformers and they matched the performance
Llemma: An Open Language Model For Mathematics	a large language model for mathematics, the authors show how using a small model in continuous pretraining you can beat bigger models on Math and STEM. deep dive
Zephyr: Direct Distillation of LM Alignment	a 7B parameter model with competitive performance to ChatGPT on AlpacaEval

News

Link	description
New Nvidia AI agent, powered by GPT-4, can train robots	Eureka, a new AI agent (powered by GPT-4) can teach complex skills to robots
‘Mind-blowing’ IBM chip speeds up AI	IBM has developed a brain-inspired computer chip that could supercharge artificial intelligence (AI) by working faster with much less power
“Math is hard” — if you are an LLM – and why that matters	LLM success on math is still limited, especially if you just rely on a LLM
Apple Rumored to Follow ChatGPT With Generative AI Features on iPhone as Soon as iOS 18	Apple plans to start implementing generative AI technology on the iPhone and iPad in late 2024 at the earliest according to analysts
Reddit can survive without search	Reddit and other companies may stop crawlers (and be not find anymore on google search) if they do not find an agreement in generative AI
This new data poisoning tool lets artists fight back against generative AI	A new tool lets artists add invisible changes to the pixels in their art before they upload it online so that if it’s scraped into an AI training set, it can cause the resulting model to break in chaotic and unpredictable ways.
AI risk must be treated as seriously as climate crisis, says Google DeepMind chief	Demis Hassabis calls for greater regulation to quell existential fears over tech with above-human levels of intelligence
Claude accessibility is expanded to 95 countries
IBM Presents NorthPole	a new chip much faster for AI and much more energy efficient
Perplexity raises new funding at $500 million valuation	Perplexity is developing an AI-powered search engine competing with the likes of OpenAI’s ChatGPT and Google’s Bard. According to recent reports, Perplexity has been generating annual recurring revenue of $3 million as of this month.
AI rapidly diagnoses brain tumours during surgery	A machine-learning method to assess DNA can accurately classify brain tumours in real time. This rapid analysis might help surgeons to identify the tumour type when operating and to adjust their surgical strategy accordingly.
AI executive order on October 30	The Biden Administration is reportedly set to unveil a broad executive order on artificial intelligence next week.
Lenovo and NVIDIA Announce Hybrid AI Solutions to Help Enterprises Quickly Adopt GenAI	New End-to-End Solutions Include Accelerated Systems, AI Software and Expert Services to Build and Deploy Domain-Specific AI Models with Ease

Resources

Link	description
caption-usampling	DALL-3 power is derived from better data quality, this library can allow you to upsample your dataset
SolidGPT	Chat everything with your code repository, ask repository-level code questions, and discuss your requirements. AI Scan and learning your code repository, provide you code repository level answer
GoLLIE 34B	zero-shot Information Extraction model for extracting information from unstructured data (CSV, JSON, and so on)
Arithmo-Mistral-7B	Mistral 7B fine-tuned on math
GraphMaker	a diffusion model capable of generating highly realisitc large attributed graphs. original article
Meta’s Habitat 3.0 simulates real-world environments for intelligent AI robot training	Researchers from Meta Platforms Inc.’s Fundamental Artificial Intelligence Research team said today they’re releasing a more advanced version of the AI simulation environment Habitat, which is used to teach robots how to interact with the physical world.
SAM-Med3D	the most comprehensive study to modify SAM for 3D medical images. Curated the most extensive volumetric medical dataset to date for training, boasting 131K 3D masks and 247 categories. paper
deepsparse	DeepSparse is a CPU inference runtime that takes advantage of sparsity to accelerate neural network inference.
ExecuTorch	PyTorch Edge: Enabling On-Device Inference Across Mobile and Edge Devices with ExecuTorch
Spelltest: AI-to-AI Testing for LLM Based Applications	Today's AI-driven applications largely depend on Large Language Models (LLMs) like GPT-4 to deliver innovative solutions. However, ensuring that they provide relevant and accurate responses in every situation is a challenge. Spelltest addresses this by simulating LLM responses using synthetic user personas and an evaluation technique to evaluate these responses automatically(but still requires human supervision).
polyfire-js	An all-in-one managed backend for AI apps. Build AI apps from the frontend, very fast
ToRA: A Tool-Integrated Reasoning Agent	ToRA is a series of Tool-integrated Reasoning Agents designed to solve challenging mathematical reasoning problems by interacting with tools, e.g., computation libraries and symbolic solvers. ToRA series seamlessly integrates natural language reasoning with the utilization of external tools, thereby amalgamating the analytical prowess of language and the computational efficiency of external tools.
Adala	Adala offers a robust framework for implementing agents specialized in data processing, with an emphasis on diverse data labeling tasks.

Perspectives

Link	description
Emotional labor and its consequences	Emotional labor is what differentiate us from AI
The Techno-Optimist Manifesto	A blog post that has ignited a strong debate in Silicon Valley about the positive impact of technology
Peak Data	a blog post discussing what will happen if the internet is filled only with AI-generated data, this will lead probably to the collapse of AI model trained on these data
Five Areas of AI Opportunity According to Snowflake’s Ahmad Khan	Lightspeed recently hosted the latest in its Generative AI series in Los Angeles, a fireside chat with Ahmad Khan, Head of AI/ML Strategy at Snowflake
An AI revolution is brewing in medicine. What will it look like?	Emerging generalist models could overcome some limitations of first-generation machine-learning tools for clinical use.
The Convergence of Data & Software Engineering in the Age of AI	This convergence signals how far data teams have evolved into core engineering teams. Machine learning’s demand for data has accelerated this movement because AI needs data to function.
Managing AI Risks in an Era of Rapid Progress	Some of the biggest names in the field (Hinton, Bengio and so on) discuss the potential threats of AI and how to manage them

Back to index

Related Projects

intel-extension-for-transformers

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques fo...

11 Nov 2022 1,909

T_Learns_Python

My python journey

12 Aug 2023 1

Machine-Learning-Guide

Machine learning Guide. Learn all about Machine Learning Tools, Libraries, Frameworks, Large Lang...

17 Oct 2020 442

Neuromorphic-Computing-Guide

Learn about the Neumorphic engineering process of creating large-scale integration (VLSI) systems...

03 Oct 2021 191

tutorial

Tutorials on machine learning, artificial intelligence, data science with math explanation and re...

19 May 2021 155

awesome-ai-agents

Awesome list of 300+ agentic AI resources

15 Jan 2024 212

ML-news-of-the-week

ML & AI news of the week

Suggestions and corrections

Index

2024

2023

2024

ML news: Week 14 - 20 October

Research

News

Resources

Perspectives

ML news: Week 7 - 13 October

Research

News

Resources

Perspectives

ML news: Week 30 September - 6 October

Research

News

Resources

Perspectives

ML news: Week 23 - 29 September

Research

News

Resources

Perspectives

ML news: Week 16 - 22 September

Research

News

Resources

Perspectives

ML news: Week 9 - 15 September

Research

News

Resources

Perspectives

ML news: Week 2 - 8 September

Research

News

Resources

Perspectives

ML news: Week 26 August - 1 September

Research

News

Resources

Perspectives

ML news: Week 19 - 25 August

Research

News

Resources

Perspectives

ML news: Week 12 - 18 August

Research

News

Resources

Perspectives

ML news: Week 5 - 11 August

Research

News

Resources

Perspectives

ML news: Week 29 July - 4 August

Research

News

Resources

Perspectives

ML news: Week 21 - 28 July

Research

News

Resources

Perspectives

ML news: Week 15 - 21 July

Research

News

Resources

Perspectives

ML news: Week 8 - 14 July

Research

News