Publication
You can also find my articles on my Google Scholar profile.
* Indicates equal contribution.
In the Pipeline
- Asiaee*, A., Abrams*, Z. B., Nakayiza, S., Sampath, D., & Coombes, K. R. (2019). Identification and comparison of genes differentially regulated
by transcription factors and miRNAs.
@article{aans19a, title = {Identification and comparison of genes differentially regulated by transcription factors and {miRNAs}}, author = {Asiaee*, Amir and Abrams*, Zachary B and Nakayiza, Samantha and Sampath, Deepa and Coombes, Kevin R}, archiveprefix = {bioRxiv}, eprint = {803643}, note = {https://doi.org/10.1101/803643}, year = {2019} }
Transcription factors and microRNAs (miRNA) both play a critical role in gene regulation and in the development of many diseases such as cancer. Understanding how transcription factors and miRNAs influence gene expression is thus important to understand, but complicated due to the large and interconnected nature of the human genome. To help better understand what genes are being regulated by transcription factors and/or miRNAs we looked at it over 8000 patient samples from 32 different cancer types collected from The Cancer Genome Atlas (TCGA). We started by clustering the transcription factors and miRNAs using Thresher to reduce the number of features. Using both the mRNA and miRNA sequencing data we constructed linear models to calculate the coefficient of determination (R2) for each mRNA based on the Thresher cluster expression. We generated three types of linear models: transcription factor, miRNA and transcription factor plus miRNA. We then determined genes that were highly explained or poorly explained by each of the three models based on the genes R2 value. We performed downstream gene enrichment analysis using ToppGene on the sets of well and poorly explained genes. This identified differences in gene regulation between transcription factors and miRNAs and showed what groups of gene are differentially regulated. - Asiaee, A., Oymak, S., Coombes, K. R., & Banerjee, A. High Dimensional Data Enrichment: Interpretable, Fast, and
Data-Efficient. Under Review In.
@article{aobc18, title = {High Dimensional Data Enrichment: Interpretable, Fast, and {Data-Efficient}}, author = {Asiaee, Amir and Oymak, Samet and Coombes, Kevin R and Banerjee, Arindam}, journal = {Under review in}, archiveprefix = {arXiv}, primaryclass = {stat.ML}, eprint = {1806.04047}, note = {https://arxiv.org/abs/1806.04047} }
High dimensional structured data enriched model describes groups of observations by shared and per-group individual parameters, each with its own structure such as sparsity or group sparsity. In this paper, we consider the general form of data enrichment where data comes in a fixed but arbitrary number of groups G. Any convex function, e.g., norms, can characterize the structure of both shared and individual parameters. We propose an estimator for high dimensional data enriched model and provide conditions under which it consistently estimates both shared and individual parameters. We also delineate sample complexity of the estimator and present high probability non-asymptotic bound on estimation error of all parameters. Interestingly the sample complexity of our estimator translates to conditions on both per-group sample sizes and the total number of samples. We propose an iterative estimation algorithm with linear convergence rate and supplement our theoretical analysis with synthetic and real experimental results. Particularly, we show the predictive power of data enriched model along with its interpretable results in anticancer drug sensitivity analysis. - Abrams, Z. B., Joglekar, A., Gershkowitz, G. R., Sinicropiyao, S., Asiaee, A., Carbone, D. P., & Coombes, K. R. Personalized Transcriptomics: Selecting Drugs Based on Gene Expression Profiles. Under Review.
@article{ajgs20, title = {Personalized Transcriptomics: Selecting Drugs Based on Gene Expression Profiles}, author = {Abrams, Zachary B. and Joglekar, Anoushka and Gershkowitz, Gregory R. and Sinicropiyao, Sara and Asiaee, Amir and Carbone, David P. and Coombes, Kevin R.}, journal = {Under review} }
Applications in Biology
- Asiaee*, A., Abrams*, Z. B., Nakayiza, S., Sampath, D., & Coombes, K. R. (2020). Explaining Gene Expression Using Twenty-One MicroRNAs. Journal of Computational Biology, Forthcoming.
@article{aans20, title = {Explaining Gene Expression Using Twenty-One MicroRNAs}, author = {Asiaee*, Amir and Abrams*, Zachary B. and Nakayiza, Samantha and Sampath, Deepa and Coombes, Kevin R.}, journal = {Journal of computational biology}, volume = {Forthcoming}, year = {2020} }
The transcriptome, or gene expression profile, of a tumor contains detailed information about the disease. Although advances in sequencing tech-nologies have generated larger and more infor-mative data sets, there are still many questions about how the transcriptome is regulated. One class of regulatory elements consists of micro-RNAs (or miRs), many of which are known to be associated with cancer. To better understand the relationships between microRNAs and differ-ent cancers, we analyzed 9000 samples from 32 cancer types studies in The Cancer Genome Atlas (TCGA). Using the Thresher R package to per-form feature reduction, we found evidence for 21 biologically interpretable clusters of miRs. Many of these clusters were statistically associated with a specific type of cancer. Moreover, the clusters contain sufficient information to distinguish be-tween most types of cancer. We then used linear models to measure, genome-wide, how much vari-ation in gene expression could be explained by the 21 average expression values (“score”) of the clus-ters. Based on the 20,000 per-gene R2 values, we found that (a) mean differences between can-cer types explain about 40% of variation; (b) the 21 miR cluster scores explain about 30% of varia-tion, and (c) combining cancer type with the miR scores explained about 56% of the total genome-wide variation in gene expression. Our analysis of poorly explained genes shows that they are en-riched for olfactory receptor processes, sensory perception and nervous system processing which are necessary to receive and interpret signals from outside the organism. Therefore, it is reasonable for those genes to be always active and not get down-regulated by miRs. In contrast, highly ex-plained genes are characterized by genes trans-lating to proteins necessary for transport, plasma membrane, or metabolic processes that are heav-ily regulated processes inside the cell. The dis-tribution of R2 values suggests that other genetic regulatory elements such as transcription factors (TF) and methylation would help explain some of the remaining variation in gene expression. By building a combined microRNA-TF-methylation model, we can potentially predict the majority of human transcriptomic expression. - Asiaee*, A., Abrams*, Z. B., Nakayiza, S., Sampath, D., & Coombes, K. R. (2019). Explaining Gene Expression Using Twenty-One MicroRNAs. Computational Biology Workshop at ICML 2019.
@inproceedings{aans19, title = {Explaining Gene Expression Using Twenty-One MicroRNAs}, author = {Asiaee*, Amir and Abrams*, Zachary B. and Nakayiza, Samantha and Sampath, Deepa and Coombes, Kevin R.}, booktitle = {Computational Biology Workshop at ICML 2019}, year = {2019} }
The transcriptome, or gene expression profile, of a tumor contains detailed information about the disease. Although advances in sequencing tech-nologies have generated larger and more infor-mative data sets, there are still many questions about how the transcriptome is regulated. One class of regulatory elements consists of micro-RNAs (or miRs), many of which are known to be associated with cancer. To better understand the relationships between microRNAs and differ-ent cancers, we analyzed ∼9000 samples from 32 cancer types studies in The Cancer Genome Atlas (TCGA). Using the Thresher R package to per-form feature reduction, we found evidence for 21 biologically interpretable clusters of miRs. Many of these clusters were statistically associated with a specific type of cancer. Moreover, the clusters contain sufficient information to distinguish be-tween most types of cancer. We then used linear models to measure, genome-wide, how much vari-ation in gene expression could be explained by the 21 average expression values (“score”) of the clus-ters. Based on the 20,000 per-gene R2 values, we found that (a) mean differences between can-cer types explain about 40% of variation; (b) the 21 miR cluster scores explain about 30% of varia-tion, and (c) combining cancer type with the miR scores explained about 56% of the total genome-wide variation in gene expression. Our analysis of poorly explained genes shows that they are en-riched for olfactory receptor processes, sensory perception and nervous system processing which are necessary to receive and interpret signals from outside the organism. Therefore, it is reasonable for those genes to be always active and not get down-regulated by miRs. In contrast, highly ex-plained genes are characterized by genes trans-lating to proteins necessary for transport, plasma membrane, or metabolic processes that are heav-ily regulated processes inside the cell. The dis-tribution of R2 values suggests that other genetic regulatory elements such as transcription factors (TF) and methylation would help explain some of the remaining variation in gene expression. By building a combined microRNA-TF-methylation model, we can potentially predict the majority of human transcriptomic expression. - Cho, M. H., Asiaee*, A., & Kurtek, S. (2019). Elastic Statistical Shape Analysis of Biological Structures with
Case Studies: A Tutorial. Bulletin of Mathematical Biology, 81(7), 2052–2073.
@article{cak18, title = {Elastic Statistical Shape Analysis of Biological Structures with Case Studies: A Tutorial}, author = {Cho, Min Ho* and Asiaee*, Amir and Kurtek, Sebastian}, journal = {Bulletin of mathematical biology}, volume = {81}, number = {7}, pages = {2052--2073}, year = {2019}, note = {https://dx.doi.org/10.1007/s11538-019-00609-w} }
We describe a recent framework for statistical shape analysis of curves and show its applicability to various biological datasets. The presented methods are based on a functional representation of shape called the square-root velocity function and a closely related elastic metric. The main benefit of this approach is its invariance to reparameterization (in addition to the standard shape-preserving transformations of translation, rotation and scale), and ability to compute optimal registrations (point correspondences) across objects. Building upon the defined distance between shapes, we additionally describe tools for computing sample statistics including the mean and covariance. Based on the covariance structure, one can also explore variability in shape samples via principal component analysis. Finally, the estimated mean and covariance can be used to define Wrapped Gaussian models on the shape space, which are easy to sample from. We present multiple case studies on various biological datasets including (1) leaf outlines, (2) internal carotid arteries, (3) Diffusion Tensor Magnetic Resonance Imaging fiber tracts, (4) Glioblastoma Multiforme tumors, and (5) vertebrae in mice. We additionally provide a MATLAB package that can be used to produce the results given in this manuscript. - Abrams, Z. B., Zucker, M., Wang, M., Asiaee Taheri, A., Abruzzo, L. V., & Coombes, K. R. (2018). Thirty biologically interpretable clusters of transcription
factors distinguish cancer type. BMC Genomics, 19(1), 738.
@article{azwa18, title = {Thirty biologically interpretable clusters of transcription factors distinguish cancer type}, author = {Abrams, Zachary B and Zucker, Mark and Wang, Min and Asiaee Taheri, Amir and Abruzzo, Lynne V and Coombes, Kevin R}, journal = {BMC genomics}, volume = {19}, number = {1}, pages = {738}, note = {https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-5093-z}, year = {2018} }
BACKGROUND: Transcription factors are essential regulators of gene expression and play critical roles in development, differentiation, and in many cancers. To carry out their regulatory programs, they must cooperate in networks and bind simultaneously to sites in promoter or enhancer regions of genes. We hypothesize that the mRNA co-expression patterns of transcription factors can be used both to learn how they cooperate in networks and to distinguish between cancer types. RESULTS: We recently developed a new algorithm, Thresher, that combines principal component analysis, outlier filtering, and von Mises-Fisher mixture models to cluster genes (in this case, transcription factors) based on expression, determining the optimal number of clusters in the process. We applied Thresher to the RNA-Seq expression data of 486 transcription factors from more than 10,000 samples of 33 kinds of cancer studied in The Cancer Genome Atlas (TCGA). We found that 30 clusters of transcription factors from a 29-dimensional principal component space were able to distinguish between most cancer types, and could separate tumor samples from normal controls. Moreover, each cluster of transcription factors could be either (i) linked to a tissue-specific expression pattern or (ii) associated with a fundamental biological process such as cell cycle, angiogenesis, apoptosis, or cytoskeleton. Clusters of the second type were more likely also to be associated with embryonically lethal mouse phenotypes. CONCLUSIONS: Using our approach, we have shown that the mRNA expression patterns of transcription factors contain most of the information needed to distinguish different cancer types. The Thresher method is capable of discovering biologically interpretable clusters of genes. It can potentially be applied to other gene sets, such as signaling pathways, to decompose them into simpler, yet biologically meaningful, components.
High Dimensional Statistics
- Asiaee, A., Oymak, S., Coombes, K. R., & Banerjee, A. (2019). Data Enrichment: Multi-task Learning in High Dimension with Theoretical Guarantees. Adaptive and Multi-Task Learning Workshop at ICML 2019.
@inproceedings{aocb19, title = {Data Enrichment: Multi-task Learning in High Dimension with Theoretical Guarantees}, author = {Asiaee, Amir and Oymak, Samet and Coombes, Kevin R. and Banerjee, Arindam}, booktitle = {Adaptive and Multi-Task Learning Workshop at ICML 2019}, year = {2019} }
Given samples from a group of related regres-sion tasks, a data-enriched model describes ob-servations by a common and per-group individual parameters. In high-dimensional regime, each pa-rameter has its own structure such as sparsity or group sparsity. In this paper, we consider the gen-eral form of data enrichment where data comes in a fixed but arbitrary number of tasks G and any convex function, e.g., norm, can character-ize the structure of both common and individual parameters. We propose an estimator for the high-dimensional data enriched model and investigate its statistical properties. We delineate the sam-ple complexity of our estimator and provide high probability non-asymptotic bound for estimation error of all parameters under a condition weaker than the state-of-the-art. We propose an itera-tive estimation algorithm with a geometric con-vergence rate. Overall, we present a first through statistical and computational analysis of inference in the data enriched model. - Asiaee T., A., Chaterjee, S., & Banerjee, A. (2016). High Dimensional Structured Estimation with Noisy Designs. 16th SIAM International Conference on Data Mining (SDM), 801–809.
@inproceedings{ascb16, title = {High Dimensional Structured Estimation with Noisy Designs}, author = {Asiaee T., Amir and Chaterjee, Soumyadeep and Banerjee, Arindam}, booktitle = {16th SIAM International Conference on Data Mining (SDM)}, pages = {801--809}, year = {2016}, organization = {SIAM} }
Structured estimation methods, such as LASSO, have received considerable attention in recent years and substantial progress has been made in extending such methods to general norms and non-Gaussian design matrices. In real world problems, however, covariates are usually corrupted with noise and there have been efforts to generalize structured estimation method for noisy covariate setting. In this paper we first show that without any information about the noise in covariates, currently established techniques of bounding statistical error of estimation fail to provide consistency guarantees. However, when information about noise covariance is available or can be estimated, then we prove consistency guarantees for any norm regularizer, which is a more general result than the state of the art. Next, we investigate empirical performance of structured estimation, specifically LASSO, when covariates are noisy and empirically show that LASSO is not consistent or stable in the presence of additive noise. However, prediction performance improves quite substantially when the noise covariance is available for incorporating in the estimator.
Social Network Analysis
- Golnari*, G., Asiaee T.*, A., Banerjee, A., & Zhang, Z.-L. (2015). Revisiting Non-Progressive Influence Models: Scalable Influence Maximization in Social Networks. 31st Conference on Uncertainty in Artificial Intelligence (UAI), 316–325.
@inproceedings{gabz15, title = {Revisiting Non-Progressive Influence Models: Scalable Influence Maximization in Social Networks.}, author = {Golnari*, Golshan and Asiaee T.*, Amir and Banerjee, Arindam and Zhang, Zhi-Li}, booktitle = {31st Conference on Uncertainty in Artificial Intelligence (UAI)}, pages = {316--325}, year = {2015} }
Influence maximization in social networks has been studied extensively in computer science community for the last decade. However, almost all of the efforts have been focused on the progressive influence models, such as independent cascade (IC) and Linear threshold (LT) models, which cannot capture the \textitreversibility of choices. In this paper, we present the Heat Conduction (HC) model which is a \textitnon-progressive influence model and has favorable real-world interpretations. Moreover, we show that HC unifies, generalizes, and extends the existing non-progressive models, such as the Voter model \citeeven-dar_note_2007 and non-progressive LT \citekempe_maximizing_2003. We then prove that selecting the optimal seed set of influential nodes is NP-hard for HC but by establishing the submodularity of influence spread, we can tackle the influence maximization problem with a scalable and provably near-optimal greedy algorithm. To the best of our knowledge, we are the first to present a scalable solution for influence maximization under non-progressive LT model, as a special case of HC model. In sharp contrast to the other greedy influence maximization methods, our fast and efficient C2Greedy algorithm benefits from two analytically computable steps: closed-form computation for finding the influence spread as well as the greedy seed selection. Through extensive experiments on several and large real and synthetic networks, we show that C2Greedy outperforms the state-of-the-art methods, under HC model, in terms of both influence spread and scalability. - Asiaee T., A., Afshar, M., & Asadpour, M. (2013). Influence maximization for informed agents in collective behavior. In Distributed Autonomous Robotic Systems (pp. 389–402). Springer.
@incollection{asaa13, title = {Influence maximization for informed agents in collective behavior}, author = {Asiaee T., Amir and Afshar, Mohammad and Asadpour, Masoud}, booktitle = {Distributed Autonomous Robotic Systems}, pages = {389--402}, year = {2013}, publisher = {Springer} }
Control of collective behavior is an active topic in biology, social, and computer science. In this work we investigate how a minority of informed agents can influence and control the whole society through local interactions. The problem we specifically target is that a minority of people with a bounded budget for initiating new social relations attempt to control the collective behavior of a society and move the crowd toward a specific goal. Assuming that local interactions can only take place between friends, the minority has to initiate some new relations with the majority. The total cost of new relations is limited to a budget. The problem is then finding the optimal links in order to gain maximum impact on the society. We will model the problem as a diffusion process in a social network. The proof of NP-hardness of the problem for Local Interaction Game model of diffusion is presented. Simulations show that the proposed method surpasses the popular strategies based on degree and distance centrality in performance. - Asiaee T., A., Tepper, M., Banerjee, A., & Sapiro, G. (2012). If you are happy and you know it... tweet. 21st ACM International Conference on Information and Knowledge Management (CIKM), 1602–1606.
@inproceedings{atbs12, title = {If you are happy and you know it... tweet}, author = {Asiaee T., Amir and Tepper, Mariano and Banerjee, Arindam and Sapiro, Guillermo}, booktitle = {21st ACM international conference on Information and knowledge management (CIKM)}, pages = {1602--1606}, year = {2012}, organization = {ACM} }
Extracting sentiment from Twitter data is one of the fundamental problems in social media analytics. Twitter’s length constraint renders determining the positive/negative sentiment of a tweet difficult, even for a human judge. In this work we present a general framework for per-tweet (in contrast with batches of tweets) sentiment analysis which consists of: (1) extracting tweets about a desired target subject, (2) separating tweets with sentiment, and (3) setting apart positive from negative tweets. For each step, we study the performance of a number of classical and new machine learning algorithms. We also show that the intrinsic sparsity of tweets allows performing classification in a low dimensional space, via random projections, without losing accuracy. In addition, we present weighted variants of all employed algorithms, exploiting the available labeling uncertainty, which further improve classification accuracy. Finally, we show that spatially aggregating our per-tweet classification results produces a very satisfactory outcome, making our approach a good candidate for batch tweet sentiment analysis.
General Machine Learning
- Asiaee T., A., Goel, H., Gosh, S., Yegneswaran, V., & Banerjee, A. (2018). Time Series Deinterleaving of DNS Traffic. 1st Deep Learning and Security Workshop (DLS).
@inproceedings{aggy18, title = {Time Series Deinterleaving of DNS Traffic}, author = {Asiaee T., Amir and Goel, Hardik and Gosh, Shalini and Yegneswaran, Vinod and Banerjee, Arindam}, booktitle = {1st Deep Learning and Security Workshop (DLS)}, year = {2018} }
Stream deinterleaving is an important problem with various applications in the cybersecurity domain. In this paper, we consider the specific problem of deinterleaving DNS data streams using machine-learning techniques, with the objective of automating the extraction of malware domain sequences. We first develop a generative model for user request generation and DNS stream interleaving. Based on these we evaluate various inference strategies for deinterleaving including augmented HMMs and LSTMs on synthetic datasets. Our results demonstrate that state-of-the-art LSTMs outperform more traditional augmented HMMs in this application domain.