Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured RepresentationsHuang, Yufeng;Tang, Jiji;Chen, Zhuo;Zhang, Rongsheng;Zhang, Xinfeng;Chen, Weijie;Zhao, Zeng;Zhao, Zhou;Lv, Tangjie;Hu, Zhipeng;Zhang, Wen
doi: 10.48550/arxiv.2305.06152pmid: N/A
Abstract:Large-scale vision-language pre-training has achieved significant performance in multi-modal understanding and generation tasks. However, existing methods often perform poorly on image-text matching tasks that require structured representations, i.e., representations of objects, attributes, and relations. As illustrated in Fig.~reffig:case (a), the models cannot make a distinction between ``An astronaut rides a horse" and ``A horse rides an astronaut". This is because they fail to fully leverage structured knowledge when learning representations in multi-modal scenarios. In this paper, we present an end-to-end framework Structure-CLIP, which integrates Scene Graph Knowledge (SGK) to enhance multi-modal structured representations. Firstly, we use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured representations. Moreover, a Knowledge-Enhance Encoder (KEE) is proposed to leverage SGK as input to further enhance structured representations. To verify the effectiveness of the proposed framework, we pre-train our model with the aforementioned approaches and conduct experiments on downstream tasks. Experimental results demonstrate that Structure-CLIP achieves state-of-the-art (SOTA) performance on VG-Attribution and VG-Relation datasets, with 12.5% and 4.1% ahead of the multi-modal SOTA model respectively. Meanwhile, the results on MSCOCO indicate that Structure-CLIP significantly enhances the structured representations while maintaining the ability of general representations. Our code is available at this https URL.
Integrating Generative Artificial Intelligence in Intelligent Vehicle SystemsStappen, Lukas;Dillmann, Jeremy;Striegel, Serena;Vögel, Hans-Jörg;Flores-Herr, Nicolas;Schuller, Björn W.
doi: 10.48550/arxiv.2305.17137pmid: N/A
Abstract: This paper aims to serve as a comprehensive guide for researchers and practitioners, offering insights into the current state, potential applications, and future research directions for generative artificial intelligence and foundation models within the context of intelligent vehicles. As the automotive industry progressively integrates AI, generative artificial intelligence technologies hold the potential to revolutionize user interactions, delivering more immersive, intuitive, and personalised in-car experiences. We provide an overview of current applications of generative artificial intelligence in the automotive domain, emphasizing speech, audio, vision, and multimodal interactions. We subsequently outline critical future research areas, including domain adaptability, alignment, multimodal integration and others, as well as, address the challenges and risks associated with ethics. By fostering collaboration and addressing these research areas, generative artificial intelligence can unlock its full potential, transforming the driving experience and shaping the future of intelligent vehicles.
AR-Diffusion: Auto-Regressive Diffusion Model for Text GenerationWu, Tong;Fan, Zhihao;Liu, Xiao;Gong, Yeyun;Shen, Yelong;Jiao, Jian;Zheng, Hai-Tao;Li, Juntao;Wei, Zhongyu;Guo, Jian;Duan, Nan;Chen, Weizhu
doi: 10.48550/arxiv.2305.09515pmid: N/A
Abstract:Diffusion models have gained significant attention in the realm of image generation due to their exceptional performance. Their success has been recently expanded to text generation via generating all tokens within a sequence concurrently. However, natural language exhibits a far more pronounced sequential dependency in comparison to images, and the majority of existing language models are trained with a left-to-right auto-regressive approach. To account for the inherent sequential characteristic of natural language, we introduce Auto-Regressive Diffusion (AR-Diffusion). AR-Diffusion ensures that the generation of tokens on the right depends on the generated ones on the left, a mechanism achieved through employing a dynamic number of denoising steps that vary based on token position. This results in tokens on the left undergoing fewer denoising steps than those on the right, thereby enabling them to generate earlier and subsequently influence the generation of tokens on the right. In a series of experiments on various text generation tasks, including text summarization, machine translation, and common sense generation, AR-Diffusion clearly demonstrated its superiority over existing diffusion language models and that it can be $100\times\sim600\times$ faster when achieving comparable results. Our code is available at this https URL.
Alleviating Exposure Bias in Diffusion Models through Sampling with Shifted Time StepsLi, Mingxiao;Qu, Tingyu;Yao, Ruicong;Sun, Wei;Moens, Marie-Francine
doi: 10.48550/arxiv.2305.15583pmid: N/A
Abstract:Diffusion Probabilistic Models (DPM) have shown remarkable efficacy in the synthesis of high-quality images. However, their inference process characteristically requires numerous, potentially hundreds, of iterative steps, which could exaggerate the problem of exposure bias due to the training and inference discrepancy. Previous work has attempted to mitigate this issue by perturbing inputs during training, which consequently mandates the retraining of the DPM. In this work, we conduct a systematic study of exposure bias in DPM and, intriguingly, we find that the exposure bias could be alleviated with a novel sampling method that we propose, without retraining the model. We empirically and theoretically show that, during inference, for each backward time step $t$ and corresponding state $\hat{x}_t$, there might exist another time step $t_s$ which exhibits superior coupling with $\hat{x}_t$. Based on this finding, we introduce a sampling method named Time-Shift Sampler. Our framework can be seamlessly integrated to existing sampling algorithms, such as DDPM, DDIM and other high-order solvers, inducing merely minimal additional computations. Experimental results show our method brings significant and consistent improvements in FID scores on different datasets and sampling methods. For example, integrating Time-Shift Sampler to F-PNDM yields a FID=3.88, achieving 44.49\% improvements as compared to F-PNDM, on CIFAR-10 with 10 sampling steps, which is more performant than the vanilla DDIM with 100 sampling steps. We will release the code upon acceptance.
Architectural Vision for Quantum Computing in the Edge-Cloud ContinuumFurutanpey, Alireza;Barzen, Johanna;Bechtold, Marvin;Dustdar, Schahram;Leymann, Frank;Raith, Philipp;Truger, Felix
doi: 10.48550/arxiv.2305.05238pmid: N/A
Abstract: Quantum processing units (QPUs) are currently exclusively available from cloud vendors. However, with recent advancements, hosting QPUs is soon possible everywhere. Existing work has yet to draw from research in edge computing to explore systems exploiting mobile QPUs, or how hybrid applications can benefit from distributed heterogeneous resources. Hence, this work presents an architecture for Quantum Computing in the edge-cloud continuum. We discuss the necessity, challenges, and solution approaches for extending existing work on classical edge computing to integrate QPUs. We describe how warm-starting allows defining workflows that exploit the hierarchical resources spread across the continuum. Then, we introduce a distributed inference engine with hybrid classical-quantum neural networks (QNNs) to aid system designers in accommodating applications with complex requirements that incur the highest degree of heterogeneity. We propose solutions focusing on classical layer partitioning and quantum circuit cutting to demonstrate the potential of utilizing classical and quantum computation across the continuum. To evaluate the importance and feasibility of our vision, we provide a proof of concept that exemplifies how extending a classical partition method to integrate quantum circuits can improve the solution quality. Specifically, we implement a split neural network with optional hybrid QNN predictors. Our results show that extending classical methods with QNNs is viable and promising for future work.
Principal Uncertainty Quantification with Spatial Correlation for Image Restoration ProblemsBelhasin, Omer;Romano, Yaniv;Freedman, Daniel;Rivlin, Ehud;Elad, Michael
doi: 10.48550/arxiv.2305.10124pmid: N/A
Abstract:Uncertainty quantification for inverse problems in imaging has drawn much attention lately. Existing approaches towards this task define uncertainty regions based on probable values per pixel, while ignoring spatial correlations within the image, resulting in an exaggerated volume of uncertainty. In this paper, we propose PUQ (Principal Uncertainty Quantification) -- a novel definition and corresponding analysis of uncertainty regions that takes into account spatial relationships within the image, thus providing reduced volume regions. Using recent advancements in generative models, we derive uncertainty intervals around principal components of the empirical posterior distribution, forming an ambiguity region that guarantees the inclusion of true unseen values with a user-defined confidence probability. To improve computational efficiency and interpretability, we also guarantee the recovery of true unseen values using only a few principal directions, resulting in more informative uncertainty regions. Our approach is verified through experiments on image colorization, super-resolution, and inpainting; its effectiveness is shown through comparison to baseline methods, demonstrating significantly tighter uncertainty regions.
Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion ModelsWang, Ruichen;Chen, Zekang;Chen, Chen;Ma, Jian;Lu, Haonan;Lin, Xiaodong
doi: 10.48550/arxiv.2305.13921pmid: N/A
Abstract:Recent text-to-image (T2I) diffusion models show outstanding performance in generating high-quality images conditioned on textual prompts. However, they fail to semantically align the generated images with the prompts due to their limited compositional capabilities, leading to attribute leakage, entity leakage, and missing entities. In this paper, we propose a novel attention mask control strategy based on predicted object boxes to address these issues. In particular, we first train a BoxNet to predict a box for each entity that possesses the attribute specified in the prompt. Then, depending on the predicted boxes, a unique mask control is applied to the cross- and self-attention maps. Our approach produces a more semantically accurate synthesis by constraining the attention regions of each token in the prompt to the image. In addition, the proposed method is straightforward and effective and can be readily integrated into existing cross-attention-based T2I generators. We compare our approach to competing methods and demonstrate that it can faithfully convey the semantics of the original text to the generated content and achieve high availability as a ready-to-use plugin. Please refer to this https URL.
Gaussian process deconvolutionTobar, Felipe;Robert, Arnaud;Silva, Jorge F.
doi: 10.1098/rspa.2022.0648pmid: N/A
Abstract: Let us consider the deconvolution problem, that is, to recover a latent source $x(\cdot)$ from the observations $\mathbf{y} = [y_1,\ldots,y_N]$ of a convolution process $y = x\star h + \eta$, where $\eta$ is an additive noise, the observations in $\mathbf{y}$ might have missing parts with respect to $y$, and the filter $h$ could be unknown. We propose a novel strategy to address this task when $x$ is a continuous-time signal: we adopt a Gaussian process (GP) prior on the source $x$, which allows for closed-form Bayesian nonparametric deconvolution. We first analyse the direct model to establish the conditions under which the model is well defined. Then, we turn to the inverse problem, where we study i) some necessary conditions under which Bayesian deconvolution is feasible, and ii) to which extent the filter $h$ can be learnt from data or approximated for the blind deconvolution case. The proposed approach, termed Gaussian process deconvolution (GPDC) is compared to other deconvolution methods conceptually, via illustrative examples, and using real-world datasets.
Scheduling Network Function Chains Under Sub-Millisecond Latency SLOsWang, Jianfeng;Gupta, Siddhant;Vieira, Marcos A. M.;Raghavan, Barath;Govindan, Ramesh
doi: 10.48550/arxiv.2305.01890pmid: N/A
Abstract: Network Function Virtualization (NFV) seeks to replace hardware middleboxes with software-based Network Functions (NFs). NFV systems are seeing greater deployment in the cloud and at the edge. However, especially at the edge, there is a mismatch between the traditional focus on NFV throughput and the need to meet very low latency SLOs, as edge services inherently require low latency. Moreover, cloud-based NFV systems need to achieve such low latency while minimizing CPU core usage. We find that real-world traffic exhibits burstiness that causes latency spikes of up to 10s of milliseconds in existing NFV systems. To address this, we built NetBlaze, which achieves sub-millisecond p99 latency SLOs, even for adversarial traffic, using a novel multi-scale core-scaling strategy. NetBlaze makes traffic-to-core allocation decisions at rack, server, and core-spatial scales, and at increasingly finer timescales, to accommodate multi-timescale bursts. In comparison with state-of-the-art approaches, NetBlaze is the only one capable of achieving sub-millisecond p99 latency SLOs while using a comparable number of cores.